Thursday, June 11, 2020

Data Analysis Iteration

WEBVTT

1
00:00:01.440 --> 00:00:05.063
This lecture's gonna provide an overview
of the cycle of data analysis.

2
00:00:10.191 --> 00:00:14.060
Data analysis is a complex process
that can involve many pieces and

3
00:00:14.060 --> 00:00:15.790
many different tools.

4
00:00:15.790 --> 00:00:18.480
But fundamentally,
there are only three parts to it.

5
00:00:19.540 --> 00:00:22.180
The first part is setting expectations.

6
00:00:22.180 --> 00:00:24.460
The second part involves
collecting information and

7
00:00:24.460 --> 00:00:26.990
comparing your expectations to data, and

8
00:00:26.990 --> 00:00:32.220
the last part Involves reacting to data,
and revising your expectations.

9
00:00:32.220 --> 00:00:33.750
So that's basically it.

10
00:00:33.750 --> 00:00:37.990
Those are the three parts of data analysis
that you often will cycle through,

11
00:00:37.990 --> 00:00:41.290
many times, in the course of
analyzing any given data set.

12
00:00:41.290 --> 00:00:42.820
So I'm gonna break down each one of these

13
00:00:43.870 --> 00:00:47.190
pieces to give you a little bit of
a description of what they are, and I'll

14
00:00:47.190 --> 00:00:51.690
give you a little example of kinda how
they might be applied in the real world.

15
00:00:51.690 --> 00:00:55.913
So steady expectations involve
deliberately thinking about

16
00:00:55.913 --> 00:00:58.599
what you're gonna do before you do it.

17
00:00:58.599 --> 00:01:02.585
Now, the idea in any part of data analysis
is that everything you do is gonna

18
00:01:02.585 --> 00:01:06.511
have some sort of consequence,
whether it's collecting data, whether

19
00:01:06.511 --> 00:01:11.035
it's fitting a model, whether it's asking
a question or making some sort of plot.

20
00:01:11.035 --> 00:01:15.494
Everything you do there will be some sort
of action, and the point is you wanna

21
00:01:15.494 --> 00:01:19.490
think about what that consequence
is gonna be, before you do it.

22
00:01:19.490 --> 00:01:22.850
And that way you set the expectations for
yourself, and

23
00:01:22.850 --> 00:01:27.980
you can determine whether the reality
kind of meets that expectation.

24
00:01:29.635 --> 00:01:33.165
So that's the first part
of the data analysis cycle.

25
00:01:33.165 --> 00:01:37.685
Once you've set your expectations the next
thing you wanna do is collect some data or

26
00:01:37.685 --> 00:01:42.380
collect some information that will
allow you to compare those expectations

27
00:01:42.380 --> 00:01:43.775
to reality.

28
00:01:43.775 --> 00:01:48.170
And so,
collecting that information is key,

29
00:01:48.170 --> 00:01:51.030
because it will tell you whether or
not your expectations were right.

30
00:01:51.030 --> 00:01:51.930
Whether they were wrong.

31
00:01:51.930 --> 00:01:53.713
Whether they were too high, too low,

32
00:01:53.713 --> 00:01:56.568
whatever it is depending on
the problem you're working on.

33
00:01:56.568 --> 00:02:01.009
And then, once you've collected that
information, and compared it to your

34
00:02:01.009 --> 00:02:05.541
expectations you can react to it, and
maybe change your behavior in some way.

35
00:02:05.541 --> 00:02:09.702
So the last part of the data analysis
cycle is to think about what have

36
00:02:09.702 --> 00:02:14.309
we learned from the data, from our
expectations, and their comparison.

37
00:02:14.309 --> 00:02:16.350
What would we do differently next time.

38
00:02:17.400 --> 00:02:20.870
Did we match our expectations,
did they not match, why or why not.

39
00:02:20.870 --> 00:02:22.443
So that's the third part.

40
00:02:22.443 --> 00:02:27.341
And then, once you've completed the third
part and you've revised your expectations,

41
00:02:27.341 --> 00:02:31.401
you may go back, with these revised
expectations and collect more data and

42
00:02:31.401 --> 00:02:35.267
try to match them again, and
then this iteration continues, often for

43
00:02:35.267 --> 00:02:37.990
many different times in
any given data analysis.

44
00:02:40.470 --> 00:02:44.830
So I just wanna give you a quick
example of how you can use these

45
00:02:44.830 --> 00:02:50.190
three components in a kind of generic or
kind of commonplace setting.

46
00:02:50.190 --> 00:02:53.360
So the basic example I'm gonna present
here is going out to dinner with

47
00:02:53.360 --> 00:02:54.510
your friends.

48
00:02:54.510 --> 00:02:56.490
So suppose you're going out to dinner and

49
00:02:56.490 --> 00:02:59.140
the restaurant you're going
to is a cash only place.

50
00:02:59.140 --> 00:03:02.410
So the question you have to ask yourself
is how much money should you bring.

51
00:03:04.005 --> 00:03:09.235
And the basic activity you're gonna
engage in is eating a meal, and you're

52
00:03:09.235 --> 00:03:13.520
gonna check for the bill, and you're gonna
have to pay, money to pay for the meal.

53
00:03:13.520 --> 00:03:17.869
But before you do that, you gotta figure
out how much money to bring, and so

54
00:03:17.869 --> 00:03:22.585
you have to figure out well, what's your
expectation for the cost of this meal.

55
00:03:22.585 --> 00:03:25.299
Maybe you've dined at this
restaurant all the time, so

56
00:03:25.299 --> 00:03:27.448
you know exactly how much it's gonna cost.

57
00:03:27.448 --> 00:03:32.487
Maybe you know, well in this city, the
typical meal costs this many dollars, and

58
00:03:32.487 --> 00:03:37.403
so I'll just bring that much money,
cuz this is an average kind of restaurant.

59
00:03:37.403 --> 00:03:38.260
Maybe you know,

60
00:03:38.260 --> 00:03:42.065
well the most expensive restaurant in
this city costs this many dollars.

61
00:03:42.065 --> 00:03:44.936
So I know it's not gonna
cost this more than that, so

62
00:03:44.936 --> 00:03:49.180
I'll just bring that to kind of serve as
an upper bound on how much money I might

63
00:03:49.180 --> 00:03:51.223
end up spending at this restaurant.

64
00:03:51.223 --> 00:03:53.365
You might ask your friends,
if they've been their before,

65
00:03:53.365 --> 00:03:54.650
how much does this place cost.

66
00:03:54.650 --> 00:03:56.060
Or you might Google the restaurant and

67
00:03:56.060 --> 00:03:59.430
maybe look up the menu to see what
the meal typically costs there.

68
00:03:59.430 --> 00:04:04.388
At any rate, before you've gone to the
restaurant and eat the meal, you can use

69
00:04:04.388 --> 00:04:08.310
any sort of opreory information
to set up your expectations for

70
00:04:08.310 --> 00:04:10.684
what the cost is ultimately gonna be.

71
00:04:10.684 --> 00:04:13.220
Before you observe the real thing.

72
00:04:14.430 --> 00:04:19.176
So once you've set your expectations, you
can figure out how much money to bring.

73
00:04:19.176 --> 00:04:22.339
The actual collecting of the data
involves going to the restaurant and

74
00:04:22.339 --> 00:04:23.255
getting the check.

75
00:04:23.255 --> 00:04:25.839
So once you've gotten the check,

76
00:04:25.839 --> 00:04:29.500
you observed the reality
of what the meal costs.

77
00:04:30.640 --> 00:04:32.560
And there's two possibilities.

78
00:04:32.560 --> 00:04:35.250
One is that,
that cost meets your expectation.

79
00:04:35.250 --> 00:04:38.510
So suppose you thought
it was gonna be $30, and

80
00:04:38.510 --> 00:04:40.710
it ended up being $30, then that's great.

81
00:04:40.710 --> 00:04:41.630
You know exactly,

82
00:04:41.630 --> 00:04:45.810
you brought the right amount of money,
and then you can pay for the meal.

83
00:04:45.810 --> 00:04:49.696
The other possibility is that
the expectations don't match of what

84
00:04:49.696 --> 00:04:50.657
the reality is.

85
00:04:50.657 --> 00:04:55.386
So thought it was $30 and
it ended up being $40.

86
00:04:55.386 --> 00:04:59.120
And so, you have to ask yourself
then why do you have that mismatch.

87
00:04:59.120 --> 00:05:04.239
Why is it that you thought it was $30 and
the meal turned out to be $40.

88
00:05:04.239 --> 00:05:05.630
So there's two possibilities.

89
00:05:05.630 --> 00:05:08.000
One is that your expectations were wrong.

90
00:05:08.000 --> 00:05:11.820
So you thought that the restaurant
was cheaper than it actually was.

91
00:05:11.820 --> 00:05:15.393
Another possibility is that there's
something wrong with the data, for

92
00:05:15.393 --> 00:05:15.920
example.

93
00:05:15.920 --> 00:05:19.192
It's possible that they added up the check
wrong, maybe they charged you for

94
00:05:19.192 --> 00:05:21.169
something for
that you didn't actually eat.

95
00:05:21.169 --> 00:05:24.835
So you can look at the check to see if
there is a problem with the data that you

96
00:05:24.835 --> 00:05:25.500
collected.

97
00:05:29.040 --> 00:05:34.040
One thing to note about
this example is that it was

98
00:05:34.040 --> 00:05:39.450
easy to know whether your expectations
were matched with the data or not.

99
00:05:39.450 --> 00:05:43.892
So for example, if your expectation was
the meal would cost $30, and then it

100
00:05:43.892 --> 00:05:48.691
actually cost $40, you know immediately
that your expectations were not right.

101
00:05:48.691 --> 00:05:52.200
The meal was $10 more than you
actually thought it was gonna be.

102
00:05:52.200 --> 00:05:55.249
And so,
you can make that conclusion very quickly.

103
00:05:55.249 --> 00:05:57.298
Another possibility, for example,

104
00:05:57.298 --> 00:06:01.405
is that you could've said well
the meal being between 0 and $1,000.

105
00:06:01.405 --> 00:06:05.885
And so, when the data actually comes
in and you see the check is $40 then it

106
00:06:05.885 --> 00:06:10.930
actually matches your expectation which
is that it's between 0 and $1,000.

107
00:06:10.930 --> 00:06:15.830
But because your original
expectation was so diffused, and

108
00:06:15.830 --> 00:06:20.535
so kind of general,
you don't really learn that much from

109
00:06:20.535 --> 00:06:25.650
collecting the data given your
very diffused expectation.

110
00:06:25.650 --> 00:06:30.160
So this brings us to an important point
which is that it's important to have

111
00:06:30.160 --> 00:06:32.370
a very sharp expectation or

112
00:06:32.370 --> 00:06:36.180
a sharp hypothesis about what
you're trying to investigate.

113
00:06:36.180 --> 00:06:38.940
When I said that I expected
the meal to be $30,

114
00:06:38.940 --> 00:06:44.210
it was very easy to know when
my expectations were not met.

115
00:06:44.210 --> 00:06:48.575
But if my expectation was very diffused
and not sharp at all, like between 0 and

116
00:06:48.575 --> 00:06:52.880
1,000, then, collecting the data
doesn't really help you.

117
00:06:52.880 --> 00:06:57.430
Or it doesn't help you learn the process
you're trying to study or in this case,

118
00:06:57.430 --> 00:06:59.360
the cost of the meal at this place.

119
00:06:59.360 --> 00:07:02.600
So ultimately,
what we're leaning toward with

120
00:07:02.600 --> 00:07:06.820
setting your expectations in collecting
data is called a change in behavior or

121
00:07:06.820 --> 00:07:09.620
an understanding of the mechanism
you're trying to study.

122
00:07:09.620 --> 00:07:13.370
What did we learn, and
what would you do differently next time?

123
00:07:13.370 --> 00:07:16.465
So in this scenario where you
thought it was gonna be $30 and

124
00:07:16.465 --> 00:07:20.589
it ended up being $40, well then the next
time you might bring an extra $10.

125
00:07:20.589 --> 00:07:23.211
If you originally thought it
was gonna be between 0 and

126
00:07:23.211 --> 00:07:27.429
$1,000 then the cost ended up being $40,
it's not clear that you would change

127
00:07:27.429 --> 00:07:29.992
anything about your behaviour
based on this data.

128
00:07:29.992 --> 00:07:35.490
And so, if there is no change
in what you might think or

129
00:07:35.490 --> 00:07:38.950
what you might do based on
the collection of the data and

130
00:07:38.950 --> 00:07:43.140
matching it with your expectations,
then that's often a sign that

131
00:07:43.140 --> 00:07:46.740
either the evidence from your experiment
is not very strong or the data analysis

132
00:07:46.740 --> 00:07:51.420
was not able to generate enough evidence,
or there may be some other problem.

133
00:07:51.420 --> 00:07:55.890
With your study or
your data analysis process.

134
00:07:55.890 --> 00:08:00.130
So setting the right expectations and
making them as sharp as possible

135
00:08:00.130 --> 00:08:03.730
is a really key element to this
whole data analysis cycle.

No comments:

5 Faedah RTOS Linux Kernel - Apa Itu Sistem Operasi Masa Nyata (RTOS) di Malaysia

Di VIENNA dimana selepas 20 Tahun, Real-Time Linux Akhirnya Masuk ke Dalam Kernel Utama Linux. Itulah pada yang memahami bagaimana berkemban...