Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded ›...

26
Final Report Data Acquisition and Visualization through Embodied Sensors - Group 5 Students: Gino Althof 1267299 Rutger Bell 1244241 Niek van den Berk 1234269 Mathias Verheijden 1234306 Gerben Vogelaar 1016370 Introduction

Transcript of Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded ›...

Page 1: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

Final Report Data Acquisition and Visualization through Embodied Sensors - Group 5 Students: Gino Althof 1267299 Rutger Bell 1244241 Niek van den Berk 1234269 Mathias Verheijden 1234306 Gerben Vogelaar 1016370 Introduction

Page 2: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

For the course ‘DASU20 - Data Acquisition and Visualization through Embodied Sensors’ the goal was to “Design personalized and data-intense products that support individual recreational sporters in both the motivational as well as in the physical aspects of enjoying their sport” as stated by the lecture slides of week 1. For this course specific it was asked to explore the possibilities to acquire and represent data from/to the sporters. To accomplish this goal GPX data files were received which the group could use to transform and acquire additional useful data from. This data would then be used to create a meaningful visualization from, which could then be presented to recreational runners to motivate them. In order to know what information would be useful for a specific recreational sporters, a fictional sporters profile was created with means of a persona. This persona would be used throughout the whole course to see if the data that was used from the GPX files would be useful for this specific persona and in a broader scope for a specific group of recreational runners. The ‘meaningfulness’ of the visualization was also determined by the persona and type of recreational runner that was examined. This report describes the reasons behind the choice for the specific type of recreational runner, the persona and the process and steps taken to choose the right data and visualization for this specific group of recreational runners. The F.A.I.R. principles for data use are described in this report and were kept in mind throughout the whole process. Exploring the datasets The project started with two preliminary assignments. For the first one a recreational runner from the 2016 marathon was picked and analysed, which for now is unimportant. The second assignment however, was about exploring the given datasets. These were gpx files from a recreational runner who had run 135 times over the course of six months. In the files were different parameters. First of all, his position was given in gps coordinates during his whole run. In addition the were some variables listed over time which were elevation, speed, heart rate and external temperature. See figure 1.

While exploring these gpx files, a couple of interesting facts came up right away. To start off, most of the runs were run in Eindhoven, but not all of them. Some were in the north of Italy or in Belgium. Since these were all around July and August, a logical explanation could be that he was on vacation

Page 3: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

during that period or was there for training purposes. We chose not to use that data because we wanted to focus on a more coherent set of data. The next big thing that came up was the fact that there was a huge difference between the length of some of the runs. These length varied from not even 700 m to 20.9 km, which could indicate a difference in recreational, training and competition runs. Using the information about the elevation of the runs, a clear difference in length could be seen between the runs in the mountains and in Eindhoven. This was however not the only reason to explain the huge length differences. The long runs could be assumed to be races, since they also were run in (slightly) different areas, such as the center of Eindhoven and a run in Belgium. Profile selection For assignment 1 a few different recreational runners from different categories were examined, it was found that many runners running for the Eindhoven marathon were either individual competitive runners or social competitive runners. For assignment 3, however, the group eventually settled on the profile selection of an individual fitness runner. Rather than being event oriented individual fitness runners run much more to de-stress, keep in good shape or to recover from a trauma which is much more focused on the psychological aspect of the runner. Individual fitness runners thus do often not have a specific goal to run for. One of the pitfalls of many individual fitness runners is thus boredom. As a result these runners run the same route over and over again. This was seen as a challenge as these runners would be hard to motivate by just providing numbers and boring graphs. For this reason a persona was created to thoroughly explore the profile of individual fitness runners and discover more motivations and frustrations.

Thijs Hermans 38 years old | Married Individual fitness runner

Photo: www.uxpressia.com F.A.I.R. This photo can be used for persona creation purposes and the person shown on this photo has given consent to use this photo for these purposes via www.uxpressia.com Goals:

● To run the half marathon of Eindhoven again ● Understand more what is important when running longer distances ● Lose a bit more weight but overall get in better shape and get a better condition ● Get in shape without any help of other people ● Try to get a consistent running schedule

Background: Thijs Hermans is 38 years old and is born and raised in Eindhoven. He knows Eindhoven at his best and has many friends and family living in Eindhoven. He never has been a sports person in his life. He has played football in a team of friends but this was more for fun than for competitiveness. Thijs likes to go to bars and pubs once in a while and going on a night out is also part of his regular

Page 4: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

lifestyle. However, one of his friends Bas challenged him last year to run the Eindhoven Marathon. He accepted the challenge back then and started running on a regular basis in order to train for the Eindhoven Marathon. In 2017 he trained for three months straight until the Eindhoven Marathon took place. At first he found it very difficult to keep on track with the schedule of his friend and therefore decided after a few weeks he wanted to start running alone. He felt quite some motivation and pressure to train as he wanted to keep up with Bas during the marathon. He also felt that running became a de-stressing activity. After the Eindhoven Marathon of 2017 Thijs thought he would stop with running. However, the marathon did not go as planned. He couldn't keep up with his friend Bas and got the determination to go on with running. He kept on running alone as he thought this was the best way for him to be motivated and make progress on his own pace. In 2018 he got a fitness tracker that recorded all of his runs with use of GPX data. He tried to look up the data as he thought it would be interesting for him to see how his runs were shaped. He found the data quite difficult to follow and wasn't motivated by it as much as he wanted. As 2018 followed Thijs got less and less motivated to run as he found it hard to find time and saw less improvement than he initially wanted. He also saw himself running the same routes over and over again and got even less motivated. He saw himself undertraining and right now still tries to seek new motivation to keep running. Motivations:

● Health/Condition ● Friends ● Clear data ● De-stressing ● Running for fun

Frustrations:

● Boredom (Same routes) ● Difficult to read data ● Less progress than initially thought ● Time management ● Undertraining

As one might notice, the motivation of a trauma was left out of this persona as this would influence our individual fitness runner in a way beyond our scope. With this persona we could focus on providing the right motivation with use of the right data that is clearly visualized. A few running risks for the runner that could be deducted from this persona description could be undertraining. Thijs Hermans had motivation before but because has no clear goal anymore has a hard time finding a good running schedule. He is, however, interested in data so therefore providing the right amount of data could severely help him into getting more motivation and thus a good running scheme. User perspective The goal of the project is to give feedback in the form of an advice to a recreational runner about his or her performance which is fully based on data. The first important point that then has to be considered is who receives this feedback. Firstly, the feedback could be presented to the sporter him- or herself. However, one could also choose to provide the information to the sporters coach or personal trainer.

Page 5: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

In this project, there was chosen to present the feedback directly to the sporter. This has a couple of reasons. First of all, there is dealt with recreational runners. A large percentage of these type of runners do not have a coach but run on their own. Therefore, presenting data through means of a coach would not target the right audience in this case. Furthermore, when presenting the feedback to the runner he/she is able to respond immediately. For example when the runner starts off at too high pace for his/her skill level, a feedback device could immediately tell the runner to slow down. A coach, who is standing at the start of a lap, could only tell the runner when running past the coach. Or, when the coach is running with the runner he/she should communicate the data appropriately. The communication of data would be an extra step in the communication process in which information could either get lost or be interpreted differently through the communication process. By choosing to present the feedback directly to the sporter, this effect is minimized. On the other hand, there are also downsides to this decision. While running, a runner is not able to fully analyze its own technique. Therefore it is hard to see for the runner itself whether he/see made an improvement by using the feedback. A coach would have been able to see since he/she is observing. Next to this fact, less experienced runners could face the problem of not understanding or not being able to use the feedback. This could be due to a lack of knowledge about the terminology used in the feedback or because, even though they got the feedback, they are not able to recognize their problems. A good coach has extensive knowledge of running and is able to make clear to the runner what is his/her problem. Therefore he could use the gained feedback properly. Additionally, a coach could provide motivation Data analysis Provided data

The goal of the data analysis was to gather useful information from the provided data files. These files are a GPX format, consisting of the creator data: the used product is a the Garmin connect app version 1.1 which parsed the data collected by a wearable into the GPX format. This format consists of the standard GPX format together with an extension to this format: ns3. However, the provided web address which should provide the extension data does not exists anymore, which means that we had to make an educated guess on what the trackpointExtension types atemp, hr and cad meant. We guessed that atemp means atmosphere temperature, hr means heart rate and cad means caddens. In short, the data present in the file is (metadata):

Metadata

latitude (lat)

longitude (lon)

elevation (ele)

time (y-m-dTh-m-smsZ)

atmosphere temperature (atemp)

caddens (cad)

Data transformation

Page 6: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

Due to the F.A.I.R. principals, we decided we wanted to process the data ourselves without the use of any external applications. As a group we decided that is was easier for us to parse .CSV files than the given GPX files. Because of this, we wrote a GPX parser that would generate CSV files which contained the data specified in the provided data section above. The first row consists of the metadata of the values. The following rows consists of the data for a specific time point in chronological order.

Run visualization

run = GPX_file / CSV_file

run1 = activity_2619648105/run1.csv

run2 = activity_2629224630/run3.csv

run3 = activity_2632478951/run5.csv

rule4 = activity_2631381560/run4.csv

Runner specific data selection Since our persona is an individual fitness runner, his main concern is staying fit. This concern expresses itself by having a routine, a weekly schedule so to say. A major part of this routine are the routes our persona runs. Looking at the data, it is found that the majority of the runs takes place in Eindhoven. To even be more specific, from those routes in Eindhoven, a rather big part is run in the north of Eindhoven, around the neighbourhood of Blixembosch, see figure 3 .

Figure 3 : a map of all the runs which were run in Eindhoven.

To eventually give a proper advice to our persona, there was chosen to only take these runs in Eindhoven into consideration. This was chosen to use all runs in Eindhoven instead of only those around Blixembosch because too big of a part of the persona’s training schedule would get lost if only the North-Eindhoven runs would be picked. In addition, it is hard to tell for some runs whether they would be picked since these were run around the borders of Blixembosch. For an individual fitness runner only some of the parts of the data from the runs are important. The persona wants to maintain its level of fitness. Therefore it is important that first of all his speed is consistent during his runs. If his speed goes down to much during a run that can either mean two things. Firstly, it could be the case that he is holding back. In that case, the runner does not really asks the best from himself and therefore does not really train. Secondly, a significant lowering speed

Page 7: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

could mean that the persona started too quickly, in which case ‘he blew himself up.’ This leads to overtraining and is therefore bad for his running endurance.

Secondly, the distance of the run is an important parameter concerning the fitness of the persona. Too short distances may lead to undertraining, while too long running distances are not good for the persona’s endurance either. A nice, responsible distance to run is therefore vital to maintain fitness.

More of a technical parameter is the heart rate consistency. Equal to the speed consistency, a balanced heart rate is crucial to train best. It is important to measure both the parameters since speed is a parameter that can be easily understood by everyone, unlike heart rate. So everyone can understand given feedback based upon speed data. However, heart rate is important to measure because it shows changes way faster. For example, if the speed goes up by one or two kilometers per hour for five minutes, the runner might not sense that he can not maintain this speed until the end of his run. However, heart rate gives this away anyway. When the persona is accelerating, his heart rate will go up significantly, which will alert the runner that he might outrun himself eventually.

The last important parameter that needs to be considered by the individual fitness runner is the frequency of his runs. This may sound trivial since too little training sessions a week will lead to undertraining and too much sessions during a week will certainly lead to overtraining, which is no good for the runner.

All parameters that were chosen are independent of the running skill level of the persona. This makes that the advice based on these parameters is valuable for every runner, whether he/she is quick or not.

Analysis runner specific benchmarks

Heart rate consistency

For the heart rate we decided to analysis the consistency of the runner. Due to inconsistencies in the data and the possible elevation in the runs we needed to normalize the data. For each run we collected every heart rate and create a visualization for these data-points, we choose a boxplot for the visualization. The boxplot gives us the ability to filter most of the inconsistent data. For a given boxplot, we calculate the range of the first and third quartile, which we call the heart rate range. This takes in consideration that there might be some elevation in the run and possible faulty heart rate measurements by the wearable.

We give ranges for possible good and bad heart rate ranges and quantify them with a textual description. Based on the heart rate ranges from all the runs, it is possible to give textual quantifiers to a given heart rate range. All the heart rate ranges can be found at the end of this chapter. These ranges are open for discussion and should be further investigated to create better and more accurate ranges. This is discussed further in the discussion section.

0-15 Good

15-20 sufficient

20-25 inconsistent

25> very inconsistent

As stated in the introduction, we create the boxplots of the given runs.

Page 8: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

By calculating the difference of the first and third quartile, we can generate the following results:

Heart rate range Result

run 1 20 inconsistent

run 2 28 very inconsistent

run 3 20 inconsistent

run 4 11 Good

For the results given in table x, we can see that there is a bit of a difference in heart rate between every run. If we would have to give a general conclusion of the runners heart rate it would be that there should be a more consistent difference and thus the runner should try to keep a more steady pace to ensure a more consistent heart rate of the user.

All runs q3-q1 in no particular order.

[19.0, 20.0, 11.75, 15.0, 10.0, 25.0, 19.0, 12.0, 11.0, 31.0, 12.0, 13.0, 26.0, 6.0, 11.0, 9.0, 10.0, 22.0, 11.0, 16.0, 16.0, 13.0, 12.0, 32.0, 30.0, 9.0, 14.0, 21.0, 18.0, 11.0, 17.0, 26.0, 9.0, 16.0, 31.0, 20.0, 13.0, 9.0, 25.0, 14.0, 17.0, 19.0, 19.0, 24.0, 13.0, 18.0, 11.0, 40.0, 15.0, 14.0, 24.0, 11.0, 10.0, 30.0, 14.0, 26.0, 28.0, 8.0, 12.0, 10.0, 28.0, 20.0, 26.0, 15.0, 9.0, 16.0, 10.0, 11.0, 16.0, 9.0, 20.0, 23.0, 18.0, 15.0, 17.0, 18.0, 10.0, 8.0, 14.0, 35.0, 34.0, 15.0, 14.0, 16.0, 13.0, 22.0, 6.0, 12.0, 11.0, 10.0, 33.0, 12.0, 9.0, 9.0, 10.0, 6.0, 24.0, 18.0, 25.0, 33.0, 8.0, 8.0, 17.0, 13.0, 11.0, 24.0, 9.0, 16.0, 15.0, 35.0]

Page 9: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

This does not take in consideration for runs with a lot of elevation and interval training, but our goal of a fitness runner is a runner that is consistent in his runs and does steady pace runs with a more consistent output. Also it does not take in consideration the switching from glucose to fat in marathons.

Speed The speed consistency of a long distance runner can indicate if a runner has a to fast start, where they will be exhausted sooner resulting in a worse time, or a slow start where they can start faster to have a better overall time since this indicates that the user has much energy left. The calculations of the data was based on the time frames and the difference in latitude and longitude. Each data-point was recorded on a 1 sec basis. To calculate the distance between two different pairs of latitude and longitude positions, we used the haversine function, which takes the curves of the earth in consideration, to get a distance in meters. By dividing the distance with the time, the speed is calculated in m/s. The visualization used for the speed consistency is a scatter plot. The scatterplot consists of every average speed between two given latitude and longitude described above and the corresponding time. From the input of a scatter plot, we create a trend of the average speed. This trend, in combination with the scatter plot, is used to give a recommendation to the user based on the slope of the users run. Below the four runs are analyzed and a trend is given. The red line shows the trend of the run.

From the trends and the progression of the run, we formulated results which are shown in the table below.

Page 10: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

Observations Result

run 1 Fast speed in the beginning than a dip, starts to speed up again

The runner needs to start slower

run 2 There is a big speed decrease in the beginning, after that it slowly goes up

The starting speed is too high, the runner should start slower

run 3 The speed keeps increasing, drops a little in the beginning

The runner could have started running faster

run 4 speed a little fluctuating in the beginning, speed keeps increasing

The starting speed could have been a little lower

To give a conclusion, the slope of a given plot can help determine whether there is a positive or negative trend in the run of the user. However, we cannot give a conclusion solely on the trend of the graph. For example, in run 1, the trend is upwards which should mean that a user should start faster. However, the graph shows us that the user already starts way to fast. This means we also have to supply the user with a good visual of the run to give a better recommendation. To give the user a good visual of the run, we want to give a abstract version of the graph to the user to give a easy to use indication of the run and this can help to interpret the recommendation better. Distance The third criteria was to prevent over- or undertraining. The distance from the run is calculated and compared to the normal schedule of the runner. Using the method also used to calculate the distance in the previous paragraph, it was possible to determine the length of each run and compare it to the average of that day. In the table below is visible what the distance of each run is, what the runner normally runs on that day and the result shown to the user. Boxplots were used to have a range of distances which can be used to determine if the distance of the current day is normal.

Page 11: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

For now, we use a 10% margin as an estimated guess for what a good distance run is compared to the distance of the day. This value may change in the future based on more research.

Distance Average distance day Result

run 1 16,7 12,9 Too much

run 2 13,8 10,5 Too much

run 3 4,2 3,2 Too much

run 4 3,9 3,3 T0o much

These results show that these runs are all above average for the distance. This indicates that the runner does too much and is in risk for overtraining. Note that the averages are calculated form all runs in the dataset whereas the run 1-4 are the first runs of the data set which shows that the user started running too much in the first ~ week/weeks.

Frequency

It was also important to let the runner know when he did not meet is weekly schedule. Below this chapter are all the are all the amount of runs per week. Since the frequency is not calculated per run but weekly, it is not possible to get data per run. Thus, the frequency is updated weekly.

Below, we show the weekly frequency of the data set to each week, where item Freq[weekday] equals the weekly frequency. If you take a look the results, you can see that the runner has a schedule that has the runner run too many times a week, where sometime it’s even the case that there are two runs a day. This makes the runner sensitive for overtraining.

Freq = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 7, 8, 6, 6, 5, 1, 3, 0, 0, 0, 6, 4, 6, 5, 6, 7, 7, 7, 6, 7, 6, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Ethics

There is lots of data available in these datasets. Some of this data could be used to do harm in some way to the runner. This may seem weird, but with the GPS data and time alone it is possible to know when the runner is outside of his house and where he is, which is privacy invasion to the runner. That is why it is important to know what data and graphs can be shared to others. Heart rate can not be shared with insurance companies, since they can use it to know how healthy their users are and change their coverage. Speed can be shared if the runner is not participating in competition, since speed alone is only useful for the runner. However, there might be implications for this data that can harm the user, thus we should still take the security of this data, together with the previously mentioned data, a priority.

Page 12: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

The distance of a run is not harmful as well, but it is combined with days of the week, which could be used to know the running schema of the runner and could result, for example, in people knowing where a runner is not at home. Even without the meta data in the CSV file, it is not hard to know what each data column means. Thus, it should be a priority to only share the data with trusted people. These are only a few possible ways for a malicious person to miss use the data. This is why we want to only share the csv files with people associated to the course “Data Acquisition and Visualization through Embodied Sensors” F.A.I.R. We will look at the F.A.I.R. principals and discuss how to comply with these principals. The criteria referenced to beneath can be found in the appendix. To be findable We make use of CSV files. The first row shows the metadata of the data. There is a clear distinction between the different identifiers e.g. lat, lon, ele, time, atemp, hr, cad. (F1). The identifiers are in line with the abbreviation of the widely used GPX formatting identifiers, thus these identifiers should be clear for the target audience (F2). The identifiers do not always specify perfectly what the data represents, however, from the the data itself it should be clear what it is about (F4). Each row represents a time in a ascending manner, which adhere to a searchable format (F3). To be accessible The data is retrievable by the receive structure present in the GPX files, since only identifiers present in the GPX files are used. (A1). CSV is a widely used format (A1.1). The format does not adhere to the authentication and authorization procedure (A1.2). The metadata is also specified in this report, thus even after that the data is no longer available, the meta data can be accessed here (give the chapter) (A2). To be interoperable The data is presented in a formal fashion, the CSV format (I1), and uses the appropriate names for the data identifiers (I2). The GPX format has references to the metadata identifiers (I3). To be re-usable The data and its analysis are part of the course Making Sense of Sensors at the TU/e and are only shared to the teachers and students participating in the course. The course follows the guidelines present in the dutch constitution (R1.1). The files are associated with this report which gives a clear provenance of the data(R1.2). As previously stated, a CSV format is used which is a widely used accessible format (R1.3). Data visualization For a non-professional runner, it is very difficult to draw significant conclusions from a numbered interface. Also, for a non-professional runner, long-term focussed feedback is not very useful. This is because it is not possible to reflect immediate improvement which can work demotivating. For these reasons, the group aimed to visualize the data that represented run-attributes that could be improved in a short amount of time, even from run to run. While doing this, as little numbers as possible were used to ensure feedback that could be easily interpreted. Therefore, the chosen data of speed, distance, frequency and heart rate, were visualised in an oversimplified way with shapes that relate to the specific category. Every visualization is then complimented by textual feedback.

Page 13: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

Apart from the individual categories, the home screen of the ‘app’ showed a visualization of the user’s running profile. The shape represents the four chosen running categories (in four directions) and will distort if there are issues with one or more of those categories. Therefore, the rounder the shape, the better the running profile of the user. This would instantly reflect good and bad results to the user instead of confronting them with complicated numbers. Finally, each visualization would use a continuous meaningful color scheme. Using the colors blue, purple and orange, they respectively represented ‘good’, ‘average’ and ‘bad’, or ‘low’, ‘average’ and ‘high’ etc. By using this color scheme in addition to the shapes, the meaning of the shapes was intensified.

For a working version of this prototype, please visit: https://xd.adobe.com/view/0378472f-98d3-4e99-5511-4ddf2ebf3075-7549/?fullscreen

Page 14: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

Discussion Even though the visualization in the end was a very good and clear representation for runners how their run could be improved, there are still some discussion points for both the data transformation and visualization to be improved. The first important discussion point is, is that the data set used in this project is from a runner who is able to run a marathon and does this multiple times per year. The data from marathons outside Eindhoven was filtered out, however the amount of runs done per week, the average distance per day and especially the running experience is not representable for the individual fitness runner that was designed for during this project. Therefore, in a next iteration when data was received that would fit the profile better, additional findings about the measured parameters might occur that would influence the visualisation designed in the end. The same goes for the recommendations done in the app, which might not always fit the Individual fitness runners’ profile perfectly. Secondly, the speed of the runners could be calculated better if the location data of the runner was gathered in a more accurate way or was gathered using the velocity of the runner instead of using the location and time. In the current datasets the speed of the runner is based on the location and time of the runner. This means that when location data of a runner is invalid the speed of the runner will drastically change. As a result, the data provided to the runner through the visualization could be inaccurate as a result of false GPS data points. In a next iteration of the process it should be considered to have a look into whether these false GPS data points significantly affect the outcome of the results shown in the speed graphs. The inaccuracy of data is also present in data such as elevation (which wasn’t used for the created visualization), which might affect other outcomes and suggestions as well. Lastly, a discussion point that has to be mentioned is that 2 of the 4 parts of the end visualization highly rely on previous runs. Right now, the visualizations were created with 6 months of previous runs taken into consideration. When a runner would start running, or begin to use the app, the two parameters of ‘average distance per run compared to past runs’ and ‘frequency of running’ would not become useful only after a couple of weeks running. An additional iteration would have to determine how to visualize these two parameters in an appropriate way without previous data present. Especially the parameter ‘average distance per run compared to past runs’ would disadvantage from a shortage of data as the average run per day would change a lot if little past runs have been ran. However, if more data is present, e.g. more than a year of past runs, recommendations could vastly improve and all 4 parameters would create a solid visualization of how good a run for the specific runner has been. Conclusion To conclude, we as a group created an actionable visualization which provides users with suggestions for runs on four specific categories that are selected based on the runner profile that was identified in the beginning of the process. The profile of an Individual fitness runner was carefully selected and examined in order to filter the right datasets to create visualisations from. With the decision to examine four different aspects of running (heart rate, speed decline, frequency and distance per day of the week) it became important to create a clear, easy to grasp and user friendly interface that showed how good a run was in an abstract way. This abstract and no-number way of visualization ensured that the individual fitness runner created earlier would be motivated by means of shape rather than numbers. The data that we used consisted of a lot of numbers. In the end, we tried to strip as many numbers away from the graphs created through the data analysis and as a result created a clean and visually pleasing visualization. As a recommendation for the runner, it would be better for the runner to not run as often as he is doing now, since he even runs twice a day sometimes. Based on the findings above, there are some important recommendations for the runner. As said before, our visualisation is actionable. Therefore the recommendations for the runner are already

Page 15: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

present in the app created. The recommendations in the app are based on the four parameters examined. For speed, runners can see if their initial speed is too high and how they could improve their speed consistency over a whole run. Heart rate is linked to the speed consistency and recommends to run more consistent as well, if necessary. Frequency and distance per day compared to average distance per day of the week prevents over or undertraining as a recommendation to build up a consistent running scheme. The data acquired from the datasets could in the future be used to create more interactive and even more helpful apps for a wider variety of runners.

Page 16: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

Personal reflections Rutger Bell During the preliminary assignments we chose as a group to all evaluate a runner and all explore the datasets. We chose this approach because now everyone has equal and the full preliminary knowledge for the main project. After those assignments, the main idea that I took home was that there is much more information one could get from some simple data than I expected. One was for instance able to track down some runners full social life from just his/her name and his/her time in the Eindhoven marathon. Furthermore, using the GPX-files, one could make a feasible estimate of where the traffic lights are placed throughout the city. For the main project, we decided to split up the tasks in such a way that everyone in our group was doing what he was good at. My group member studying software-science for example took care that the data was readable because he has the most experience with coding. To go on, one of my Industrial design group members made the visualization because he knows best how to do that. As a physics student, this project is somewhat different for what I do normally. So I assigned myself the task to mainly work on the report, since that requires the least background knowledge. I think that my contribution to the project is more or less equal to that of the rest of my group since I put equal effort and time in. The main lesson I learned during this course and which I can apply now during the rest of my study is that you can get so much information from so little data. This makes that I will now search deeper in datasets presented during the rest of my study to gain all possible information out of it. On group-cooperation level I learned that tasks should be distributed logically to gain the best result. Gino Althof For this project, our group split up the tasks among the members, so everyone did something they were good at. Gerben and I programmed everything. I have followed the course data analysis before, but did not see the use of that course. During this course I found out how data can be used to know the user and help them improve certain things. If I were to do this project again, I would have like to learn something different than programming, since I already knew how to do that. Making visuals would have been better for me, because I do not do that often and would like to improve on this part. I helped the whole group with gathering data, and try to find connections between different kinds of info. For example, in the first assignment we saw a big abnormal increase in speed, and quickly found out that it was an underpass. I also learned that data can be used in multiple ways if combined with other data. For example, GPS location shows where someone runs. It can also be used to calculate the distance and combined with time can be used to calculate speed. I did not expect to have a lot of variables to research, but quickly found out the 6 columns present in the files can be used for at least 10 variables. I do not do any sports, but I learned a lot from this course about sporting. Now I know it is possible to over- and under train. I also found out that not everything useful for a fitness runner is useful for a social runner, so it is important to know who your user is going to be and what is needed for that user. This data will be used for the next course and in other projects related to sports if I will do any in the future. Gerben Vogelaar

Page 17: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

The project started with a small project on the marathon of Eindhoven where we had to analyze a few runners that finished with some slower times. This gave great insight on how even with a few data available for a runner you can clearly identify multiple important aspect on the philosophy of the runner, where for example it was possible to determine what drive of the runner. I found this to be a great introduction to the course. For the main project, we had to analyze a data set of a specific runner. The team was divided in multiple groups to make the most of everyone's expertise. But before this happened, we tried to discuss the project together to form our goals for the next weeks, and discuss further before each meeting. My part of the project, together with Gino Althof, was to transform a compute the required data from the available GPX files. Even though, I am a bachelor Software Science student, this was a new experience for me since I had no background in python or GPX/CSV prior. This formed a welcomed new experience that I can say I found to be quite enjoyable with good results as a result. The project learned me to work with data formats, the transformation of the data and most important, that even data, with just a few parameters, can give great insight in a user which can be used for both good purposes, the improvement of the users health, or for bad, malicious data usage. As a sportsman myself, while being more into the strength sports, I could see there are some similarities from a runner compared to someone practising strength sports. How one should slowly improve and should be careful for undertraining, no progress, or overtraining, lots of fatigue or even injuries. I found that the course itself, for the sports specific information, did not give me any more information than I previously had. But how one can use data to improve, mainly a novice user, to a more active life and/or even a better performance was a new insight. Niek van den Berk For this course I feel like I’ve learned a lot about the overall design process with real data files. In previous projects I never used real datasets to design with. I found it really interesting to see how persona’s, data and visualizations all play into each other into creating an understandable outcome for a specific group of runners. During this course my main focus lay with creating the persona and providing a substantiation for why we chose ‘Individual fitness runners’ as our main focus. I had some previous expertise with making persona’s but this was the first time in a project I really felt it contributed to the whole process rather than a part of it. Almost all further decisions of the course were based on the persona and specification of the type of recreational runner we as group chose. Therefore, it was very important for me to create a well defined persona. In the end, I felt like I gathered a lot more expertise in making persona’s than before as I put more effort in making the persona than in previous projects. I am happy with this growth within this expertise and I am planning to make more persona’s in future projects to get a better understanding of certain user groups and their needs. Additionally, I tried to help both Gerben and Gino with exploring the data files. I tried to get useful user data out of the GPX files as well to use for the visualisation. However, I recognized both Gino and Gerben had way more experience in coding in Python. In the end, I therefore contributed to the decision making process in terms of choosing which data to transform and what data to visualize in the visualization. I reckon I could’ve gotten more coding experience out of this course by actually coding and transforming the data myself and I pity that I did not contributed to this part of the project as much as I would’ve liked. However, I do feel like I now understand how data transformation plays a key role in the design process. Furthermore, I now understand how real data such as GPS location data, speed data and heart rate data can be used into creating a motivational visualization for a recreational sporter that focuses on aspects such as undertraining, overtraining and a good running scheme based on speed decline/increase and heart rate range. In next projects I would like to do more coding myself just as during the course ‘2IAB0 - Data analytics for Engineers’. However, I do

Page 18: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

feel that because of this course I became more profound in choosing the right data to transform and encode. Mathias Verheijden For this course I feel like I have significantly learned about a multidisciplinary work process. Before enrolling into this course, I expected to have trouble with the coding the data. Therefore, I am happy that the multidisciplinary structure allowed me to apply the skills from my own area of expertise. The collaboration in this course helped me to better understand terms and values from different disciplines, which will be very useful for future projects. This experience allows me to better communicate and mediate within a multidisciplinary group. On the theoretical side, I learned to work with real data in an ethical way and according to the FAIR data principle, which was completely new to me. This, in combination with the assignment’s end user in mind, resulted in a user-centered design process where multiple stakeholder values had to be accounted for. Since I am not good at the coding side of data analysis, I contributed to the decision process of how the data should be analyzed, which variables should be used, and how they should be visualized. This process taught me how to get a pre-analysis understanding of raw data and discover the possibilities in the data. This will save me significant amounts of time in future similar projects. In the end, we chose a data analysis suited for short-term user feedback. With my experience in industrial and graphic design, I was responsible for all the visualization towards the user in this course. Before this visualization started however, I learned from my teammates regarding the technical aspect of data analysis. Although I was less involved in this part, the material discussed in the meetings gave me more insight in the possibilities of data analysis. Still, I feel like I have not significantly developed myself in this area, but given the fact this was a multidisciplinary project, I focussed on my own. For the visualization, I had to transform pre-analyzed data to a suitable user-friendly interface. This required me to develop a visualization that highlights the key points of the data, while smoothening the insignificant details. This process taught me to understand the difference between significant and insignificant data and apply this to the user’s needs. It was the first time I made a complete data visualization from scratch, but I really think I’ve outdone myself. This visualization broadened my skill set in graphic design, something I use almost daily in my studies. Overall, this course contributed to the development of new skills and the improvement of pre-existing skills. My growth in my own expertise area is something I appreciate, but I value the increased experience in the multidisciplinary work setting the most. Appendix

Page 19: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

To be Findable:

F1. (meta)data are assigned a globally unique and eternally persistent identifier. F2. data are described with rich metadata. F3. (meta)data are registered or indexed in a searchable resource. F4. metadata specify the data identifier.

To be accessible:

A1 (meta)data are retrievable by their identifier using a standardized communications protocol. A1.1 the protocol is open, free, and universally implementable. A1.2 the protocol allows for an authentication and authorization procedure, where necessary. A2 metadata are accessible, even when the data are no longer available.

To be interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles. I3. (meta)data include qualified references to other (meta)data.

To be re-usable:

R1. meta(data) have a plurality of accurate and relevant attributes. R1.1. (meta)data are released with a clear and accessible data usage license. R1.2. (meta)data are associated with their provenance. R1.3. (meta)data meet domain-relevant community standards.

Code for averages in runs in Eindhoven

import os import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt from gmplot import gmplot import datetime import numpy as np import seaborn as sns def haversine(lon1, lat1, lon2, lat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) All args must be of equal length. """ lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2]) dlon = lon2 - lon1 dlat = lat2 - lat1 a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2 c = 2 * np.arcsin(np.sqrt(a)) m = 6367 * c * 1000 return m

Page 20: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

#https://stackoverflow.com/questions/40452759/pandas-latitude-longitude-to-distance-between-successive-rows gmap = gmplot.GoogleMapPlotter(51.5, 5.5, 13) directory = os.fsencode('C:\\Users\\20176076\\Documents\\csvFiles') #directory of the files #Lists to store all distances monday_distance = [] tuesday_distance = [] wednesday_distance = [] thursday_distance = [] friday_distance = [] saturday_distance = [] sunday_distance = [] HRs = [] #Create an empty list for the frequency per week Frequency = [0]*52 for x in range(0, 52): Frequency[x] = 0 for file in os.listdir(directory): filename = os.fsdecode(file) if ('results' in filename): #Only take the CSV files temp_df = pd.read_csv(filename, sep=';') #Take the gps coordinates of the file and check if it is in Eindhoven xCoor = temp_df.iloc[0,0] yCoor = temp_df.iloc[0,1] if (xCoor > 51 and xCoor < 52 and yCoor > 5 and yCoor < 6): #Get the date of the run, times = temp_df.iloc[0,3].split('T') date = times[0] #Get the current day day = datetime.datetime.strptime(date, '%Y-%m-%d').strftime('%a') #Calculate the distance of the day with the haversine function temp_df['dist'] = \ haversine(temp_df.lon.shift(), temp_df.lat.shift(), temp_df.loc[1:, 'lon'], temp_df.loc[1:, 'lat']) #Add all the distances between points together to get one total distance for the run distance = temp_df['dist'].sum() if (day == 'Mon'): monday_distance.append(distance) elif (day == 'Tue'): tuesday_distance.append(distance) elif (day == 'Wed'): wednesday_distance.append(distance) elif (day == 'Thu'): thursday_distance.append(distance) elif (day == 'Fri'): friday_distance.append(distance) elif (day == 'Sat'): saturday_distance.append(distance) elif (day == 'Sun'): sunday_distance.append(distance)

Page 21: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

#Add the run to the map lats = temp_df['lat'] longs = temp_df['lon'] gmap.plot(lats, longs, 'cornflowerblue', edge_width=1) #Get the list of heart rates and remove the occasional NONE temp_hr = temp_df['hr'].tolist() wrongRow = [] for x in range(len(temp_hr)): if (temp_hr[x] != 'NONE'): wrongRow.append(float(temp_hr[x])) #Calculate the range of the heart rate and add it to the HRs list a = np.percentile(wrongRow, 25) b = np.percentile(wrongRow, 75) HRs.append(b-a) #Calculate the weeknumber and increase the number of that week in the week list times = temp_df.iloc[0,3].split('T') date = times[0] numbers = date.split('-') week = (datetime.date(int(numbers[0]), int(numbers[1]), int(numbers[2])).isocalendar()[1]) - 1 Frequency[week]+= 1 gmap.draw("my_map.html") all_distances = pd.DataFrame([monday_distance,tuesday_distance,wednesday_distance,thursday_distance,friday_distance,saturday_distance,sunday_distance]).T sns.set_style("whitegrid") ax = sns.boxplot( palette=["m", "g"], data=all_distances) ax.set_xticklabels(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],rotation = 30) ax.set_ylabel('Distance (m)') ax.set_title('Distance average per day') fig = ax.get_figure() fig.savefig("boxplots.png",facecolor="white") #print the average distance of each day print(np.percentile(monday_distance, 50)) print(np.percentile(tuesday_distance, 50)) print(np.percentile(wednesday_distance, 50)) print(np.percentile(thursday_distance, 50)) print(np.percentile(friday_distance, 50)) print(np.percentile(saturday_distance, 50)) print(np.percentile(sunday_distance, 50)) print(Frequency) print(HRs)

Code for specific runs for the visualizations import csv from math import radians, cos, sin, asin, sqrt import numpy as np import seaborn as sns

Page 22: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

sns.set_style("whitegrid") def haversine(lon1, lat1, lon2, lat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) All args must be of equal length. """ lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2]) dlon = lon2 - lon1 dlat = lat2 - lat1 a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2 c = 2 * np.arcsin(np.sqrt(a)) m = 6367 * c * 1000 return m global listHR global listD global listS global listSpeedDistance global distance global previousLat global previousLon global itemCount global day listHR = [] listD = [] listS = [] listSpT = [] listSpeedDistance = [] distance = 0 previousLat = 0 previousLon = 0 itemCount = 0 day = '' with open('results4.csv') as csv_file: #Change the file to one of the four runs csv_reader = csv.reader(csv_file, delimiter = ';') for row in csv_reader: #row one contains only the metadata, no data can be extracted from here if(itemCount == 0) : print() else: #set the first lat and lon, also get the data from the first time frame. if(itemCount == 1) : previousLat = float(row[0]) previousLon = float(row[1]) day = row[3] #get first 10 characters to find year-month-date (including the two -) day = day[:10] print("date: " , day) else: #calculate distance from previous row/data row. Since every row is recorded 1 sec from the previous, speed = ddistance*3600 (km/h) ddistance = haversine(previousLat, previousLon, float(row[0]), float(row[1])) distance += ddistance

Page 23: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

speed = float(distance / itemCount) listD.append(distance) listS.append(speed) #Since HR can contain NONE values, we have to filter these out. hr = row[5] if(hr != 'NONE'): listHR.append(int(hr)) #set next lat and lon previousLat = float(row[0]) previousLon = float(row[1]) itemCount+=1 print(distance) #calculate the first and third quartile through the numpy library, we supply this def with a list. def HRconsistancy(listHearthRates = [], *args): Q1 = np.percentile(listHearthRates,25) Q3 = np.percentile(listHearthRates,75) print(Q1, Q3, "Range: ", Q1-Q3 ) HR_fig, HR_ax = plt.subplots() sns.boxplot(data = listHearthRates, ax=HR_ax) HR_ax.set_ylabel('Heart rate (BPM)') HR_ax.set_title('Heart rate run four') HR_fig.savefig("Heartrate.png",facecolor="white") HRconsistancy(listHR) #using seaborn library to plot a function from the speed and distance. def speedDecline(): SP_fig, SP_ax = plt.subplots() sns.regplot(listD, listS, ci=80,scatter_kws={'s':10},line_kws={'color': 'red'}) SP_ax.set_xlabel('Distance (m)') SP_ax.set_ylabel('Speed (m/s)') SP_ax.set_title('Speed run four') SP_fig.savefig("speedDecline.png",facecolor="white") speedDecline()

Code to convert GPX to CSV #element tree is seen as a fast xml parser library import xml.etree.ElementTree as ET import re import os.path import csv ''' multiline comment xml <element-tag elemnt-attr> element.text </tag element.tag>

Page 24: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

Normal gpx namespace is used (ns) and a custom ns:3. For ns:3 we have a seperate def that parses this type ns ;{http://www.topografix.com/GPX/1/1} ns:'3 ;{http://www.garmin.com/xmlschemas/TrackPointExtension/v1} ''' nsLen = len('{http://www.topografix.com/GPX/1/1}') ns3Len = len('{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}') global arrayPoints arrayPoints = [] #append this data row for the datapoints types #arrayPoints.append(['lat', 'lon', 'ele', 'time', 'atemp', 'hr', 'cad']) global dataPoint #datapoint = (lat, lon, ele, time, atemp, hr, cad) # (0, 1, 2, 3, 4, 5, 6 ) dataPoint = [] def clearDataPoint() : global dataPoint NONE = 'NONE' dataPoint = [NONE,NONE,NONE,NONE,NONE,NONE,NONE]; def showNS3(elem): elemTag = elem.tag[ns3Len:] global dataPoint global arrayPoints if(elemTag == 'atemp') : dataPoint[4] = elem.text if(elemTag == 'hr') : dataPoint[5] = elem.text if(elemTag == 'cad') : dataPoint[6] = elem.text arrayPoints.append(dataPoint) #print(dataPoint) clearDataPoint() #print(elem.attrib, "," , elemTag, ",", elem.text) for child in elem.findall('*'): showNS3(child) def showNS(elem): elemTag = elem.tag[nsLen:] global dataPoint if(elemTag == 'trkseg') : clearDataPoint() if(elemTag == 'trkpt') : print(str(elem.attrib['lat']), " : ", str(elem.attrib['lon'])) dataPoint[0] = str(elem.attrib['lat']) dataPoint[1] = str(elem.attrib['lon']) if(elemTag == 'ele') : dataPoint[2] = elem.text if(elemTag == 'time') : dataPoint[3] = elem.text

Page 25: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

#print(elem.attrib, "," , elemTag, ",", elem.text) if(elemTag == 'extensions') : for child in elem.findall('*') : showNS3(child) else : for child in elem.findall('*') : showNS(child) ''' writeCSV par: csvData; list of lists containing data namefile: name of the csv file being created return; none ''' def writeCSV(csvData, namefile, gpxfilename): with open(namefile, 'w') as csvFile: #fieldnames = [] , add the field names here, you can add this to csv.writer with fieldnames = fieldnames writer = csv.writer(csvFile, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL, lineterminator='\n') writer.writerow(['lat','lon','ele','time','attempt', 'hr', 'cad']) for a in range(0,len(csvData)-1): try: writer.writerow([csvData[a][0], csvData[a][1], csvData[a][2], csvData[a][3], csvData[a][4], csvData[a][5], csvData[a][6]]) except: print(a, ' ', gpxfilename, ' ', a[3]) csvFile.close() def parseAndGenerateCSV(root, nameCVSFile, gpxfilename): global arrayPoints showNS(root) #print(arrayPoints) print("file parsed") for item in arrayPoints : if(len(item) != 7): print(item) #print(item) writeCSV(arrayPoints, nameCVSFile, gpxfilename) #WRITE THE CSV FILE, print("csv created ", nameCVSFile, " ", gpxfilename) def iterateAllFiles(): script_dir = os.path.dirname(_file_) absDir = os.path.join(script_dir, 'filesGPXXX') directory = os.fsencode(absDir) count = 0 print(absDir) for file in os.listdir(directory): filename = os.fsdecode(file)

Page 26: Final Report Data Acquisition and Visualization through ... › d5e3272e › files › uploaded › Dat… · The files are associated with this report which gives a clear provenance

if filename.endswith(".gpx"): fileN = os.fsdecode(directory) source = open(os.path.join(fileN, filename), 'r') #create Elementtree from the file in location source tree = ET.parse(source) root = tree.getroot(); nameCVSFile = 'results' + str(count) + '.csv' global arrayPoints arrayPoints.clear() clearDataPoint() parseAndGenerateCSV(root, nameCVSFile, file) print(os.path.join(fileN, filename)) count = count + 1 continue else: continue iterateAllFiles()