WEBVTT
1
00:00:01.440 --> 00:00:05.063
This lecture's gonna provide an overview
of the cycle of data analysis.
2
00:00:10.191 --> 00:00:14.060
Data analysis is a complex process
that can involve many pieces and
3
00:00:14.060 --> 00:00:15.790
many different tools.
4
00:00:15.790 --> 00:00:18.480
But fundamentally,
there are only three parts to it.
5
00:00:19.540 --> 00:00:22.180
The first part is setting expectations.
6
00:00:22.180 --> 00:00:24.460
The second part involves
collecting information and
7
00:00:24.460 --> 00:00:26.990
comparing your expectations to data, and
8
00:00:26.990 --> 00:00:32.220
the last part Involves reacting to data,
and revising your expectations.
9
00:00:32.220 --> 00:00:33.750
So that's basically it.
10
00:00:33.750 --> 00:00:37.990
Those are the three parts of data analysis
that you often will cycle through,
11
00:00:37.990 --> 00:00:41.290
many times, in the course of
analyzing any given data set.
12
00:00:41.290 --> 00:00:42.820
So I'm gonna break down each one of these
13
00:00:43.870 --> 00:00:47.190
pieces to give you a little bit of
a description of what they are, and I'll
14
00:00:47.190 --> 00:00:51.690
give you a little example of kinda how
they might be applied in the real world.
15
00:00:51.690 --> 00:00:55.913
So steady expectations involve
deliberately thinking about
16
00:00:55.913 --> 00:00:58.599
what you're gonna do before you do it.
17
00:00:58.599 --> 00:01:02.585
Now, the idea in any part of data analysis
is that everything you do is gonna
18
00:01:02.585 --> 00:01:06.511
have some sort of consequence,
whether it's collecting data, whether
19
00:01:06.511 --> 00:01:11.035
it's fitting a model, whether it's asking
a question or making some sort of plot.
20
00:01:11.035 --> 00:01:15.494
Everything you do there will be some sort
of action, and the point is you wanna
21
00:01:15.494 --> 00:01:19.490
think about what that consequence
is gonna be, before you do it.
22
00:01:19.490 --> 00:01:22.850
And that way you set the expectations for
yourself, and
23
00:01:22.850 --> 00:01:27.980
you can determine whether the reality
kind of meets that expectation.
24
00:01:29.635 --> 00:01:33.165
So that's the first part
of the data analysis cycle.
25
00:01:33.165 --> 00:01:37.685
Once you've set your expectations the next
thing you wanna do is collect some data or
26
00:01:37.685 --> 00:01:42.380
collect some information that will
allow you to compare those expectations
27
00:01:42.380 --> 00:01:43.775
to reality.
28
00:01:43.775 --> 00:01:48.170
And so,
collecting that information is key,
29
00:01:48.170 --> 00:01:51.030
because it will tell you whether or
not your expectations were right.
30
00:01:51.030 --> 00:01:51.930
Whether they were wrong.
31
00:01:51.930 --> 00:01:53.713
Whether they were too high, too low,
32
00:01:53.713 --> 00:01:56.568
whatever it is depending on
the problem you're working on.
33
00:01:56.568 --> 00:02:01.009
And then, once you've collected that
information, and compared it to your
34
00:02:01.009 --> 00:02:05.541
expectations you can react to it, and
maybe change your behavior in some way.
35
00:02:05.541 --> 00:02:09.702
So the last part of the data analysis
cycle is to think about what have
36
00:02:09.702 --> 00:02:14.309
we learned from the data, from our
expectations, and their comparison.
37
00:02:14.309 --> 00:02:16.350
What would we do differently next time.
38
00:02:17.400 --> 00:02:20.870
Did we match our expectations,
did they not match, why or why not.
39
00:02:20.870 --> 00:02:22.443
So that's the third part.
40
00:02:22.443 --> 00:02:27.341
And then, once you've completed the third
part and you've revised your expectations,
41
00:02:27.341 --> 00:02:31.401
you may go back, with these revised
expectations and collect more data and
42
00:02:31.401 --> 00:02:35.267
try to match them again, and
then this iteration continues, often for
43
00:02:35.267 --> 00:02:37.990
many different times in
any given data analysis.
44
00:02:40.470 --> 00:02:44.830
So I just wanna give you a quick
example of how you can use these
45
00:02:44.830 --> 00:02:50.190
three components in a kind of generic or
kind of commonplace setting.
46
00:02:50.190 --> 00:02:53.360
So the basic example I'm gonna present
here is going out to dinner with
47
00:02:53.360 --> 00:02:54.510
your friends.
48
00:02:54.510 --> 00:02:56.490
So suppose you're going out to dinner and
49
00:02:56.490 --> 00:02:59.140
the restaurant you're going
to is a cash only place.
50
00:02:59.140 --> 00:03:02.410
So the question you have to ask yourself
is how much money should you bring.
51
00:03:04.005 --> 00:03:09.235
And the basic activity you're gonna
engage in is eating a meal, and you're
52
00:03:09.235 --> 00:03:13.520
gonna check for the bill, and you're gonna
have to pay, money to pay for the meal.
53
00:03:13.520 --> 00:03:17.869
But before you do that, you gotta figure
out how much money to bring, and so
54
00:03:17.869 --> 00:03:22.585
you have to figure out well, what's your
expectation for the cost of this meal.
55
00:03:22.585 --> 00:03:25.299
Maybe you've dined at this
restaurant all the time, so
56
00:03:25.299 --> 00:03:27.448
you know exactly how much it's gonna cost.
57
00:03:27.448 --> 00:03:32.487
Maybe you know, well in this city, the
typical meal costs this many dollars, and
58
00:03:32.487 --> 00:03:37.403
so I'll just bring that much money,
cuz this is an average kind of restaurant.
59
00:03:37.403 --> 00:03:38.260
Maybe you know,
60
00:03:38.260 --> 00:03:42.065
well the most expensive restaurant in
this city costs this many dollars.
61
00:03:42.065 --> 00:03:44.936
So I know it's not gonna
cost this more than that, so
62
00:03:44.936 --> 00:03:49.180
I'll just bring that to kind of serve as
an upper bound on how much money I might
63
00:03:49.180 --> 00:03:51.223
end up spending at this restaurant.
64
00:03:51.223 --> 00:03:53.365
You might ask your friends,
if they've been their before,
65
00:03:53.365 --> 00:03:54.650
how much does this place cost.
66
00:03:54.650 --> 00:03:56.060
Or you might Google the restaurant and
67
00:03:56.060 --> 00:03:59.430
maybe look up the menu to see what
the meal typically costs there.
68
00:03:59.430 --> 00:04:04.388
At any rate, before you've gone to the
restaurant and eat the meal, you can use
69
00:04:04.388 --> 00:04:08.310
any sort of opreory information
to set up your expectations for
70
00:04:08.310 --> 00:04:10.684
what the cost is ultimately gonna be.
71
00:04:10.684 --> 00:04:13.220
Before you observe the real thing.
72
00:04:14.430 --> 00:04:19.176
So once you've set your expectations, you
can figure out how much money to bring.
73
00:04:19.176 --> 00:04:22.339
The actual collecting of the data
involves going to the restaurant and
74
00:04:22.339 --> 00:04:23.255
getting the check.
75
00:04:23.255 --> 00:04:25.839
So once you've gotten the check,
76
00:04:25.839 --> 00:04:29.500
you observed the reality
of what the meal costs.
77
00:04:30.640 --> 00:04:32.560
And there's two possibilities.
78
00:04:32.560 --> 00:04:35.250
One is that,
that cost meets your expectation.
79
00:04:35.250 --> 00:04:38.510
So suppose you thought
it was gonna be $30, and
80
00:04:38.510 --> 00:04:40.710
it ended up being $30, then that's great.
81
00:04:40.710 --> 00:04:41.630
You know exactly,
82
00:04:41.630 --> 00:04:45.810
you brought the right amount of money,
and then you can pay for the meal.
83
00:04:45.810 --> 00:04:49.696
The other possibility is that
the expectations don't match of what
84
00:04:49.696 --> 00:04:50.657
the reality is.
85
00:04:50.657 --> 00:04:55.386
So thought it was $30 and
it ended up being $40.
86
00:04:55.386 --> 00:04:59.120
And so, you have to ask yourself
then why do you have that mismatch.
87
00:04:59.120 --> 00:05:04.239
Why is it that you thought it was $30 and
the meal turned out to be $40.
88
00:05:04.239 --> 00:05:05.630
So there's two possibilities.
89
00:05:05.630 --> 00:05:08.000
One is that your expectations were wrong.
90
00:05:08.000 --> 00:05:11.820
So you thought that the restaurant
was cheaper than it actually was.
91
00:05:11.820 --> 00:05:15.393
Another possibility is that there's
something wrong with the data, for
92
00:05:15.393 --> 00:05:15.920
example.
93
00:05:15.920 --> 00:05:19.192
It's possible that they added up the check
wrong, maybe they charged you for
94
00:05:19.192 --> 00:05:21.169
something for
that you didn't actually eat.
95
00:05:21.169 --> 00:05:24.835
So you can look at the check to see if
there is a problem with the data that you
96
00:05:24.835 --> 00:05:25.500
collected.
97
00:05:29.040 --> 00:05:34.040
One thing to note about
this example is that it was
98
00:05:34.040 --> 00:05:39.450
easy to know whether your expectations
were matched with the data or not.
99
00:05:39.450 --> 00:05:43.892
So for example, if your expectation was
the meal would cost $30, and then it
100
00:05:43.892 --> 00:05:48.691
actually cost $40, you know immediately
that your expectations were not right.
101
00:05:48.691 --> 00:05:52.200
The meal was $10 more than you
actually thought it was gonna be.
102
00:05:52.200 --> 00:05:55.249
And so,
you can make that conclusion very quickly.
103
00:05:55.249 --> 00:05:57.298
Another possibility, for example,
104
00:05:57.298 --> 00:06:01.405
is that you could've said well
the meal being between 0 and $1,000.
105
00:06:01.405 --> 00:06:05.885
And so, when the data actually comes
in and you see the check is $40 then it
106
00:06:05.885 --> 00:06:10.930
actually matches your expectation which
is that it's between 0 and $1,000.
107
00:06:10.930 --> 00:06:15.830
But because your original
expectation was so diffused, and
108
00:06:15.830 --> 00:06:20.535
so kind of general,
you don't really learn that much from
109
00:06:20.535 --> 00:06:25.650
collecting the data given your
very diffused expectation.
110
00:06:25.650 --> 00:06:30.160
So this brings us to an important point
which is that it's important to have
111
00:06:30.160 --> 00:06:32.370
a very sharp expectation or
112
00:06:32.370 --> 00:06:36.180
a sharp hypothesis about what
you're trying to investigate.
113
00:06:36.180 --> 00:06:38.940
When I said that I expected
the meal to be $30,
114
00:06:38.940 --> 00:06:44.210
it was very easy to know when
my expectations were not met.
115
00:06:44.210 --> 00:06:48.575
But if my expectation was very diffused
and not sharp at all, like between 0 and
116
00:06:48.575 --> 00:06:52.880
1,000, then, collecting the data
doesn't really help you.
117
00:06:52.880 --> 00:06:57.430
Or it doesn't help you learn the process
you're trying to study or in this case,
118
00:06:57.430 --> 00:06:59.360
the cost of the meal at this place.
119
00:06:59.360 --> 00:07:02.600
So ultimately,
what we're leaning toward with
120
00:07:02.600 --> 00:07:06.820
setting your expectations in collecting
data is called a change in behavior or
121
00:07:06.820 --> 00:07:09.620
an understanding of the mechanism
you're trying to study.
122
00:07:09.620 --> 00:07:13.370
What did we learn, and
what would you do differently next time?
123
00:07:13.370 --> 00:07:16.465
So in this scenario where you
thought it was gonna be $30 and
124
00:07:16.465 --> 00:07:20.589
it ended up being $40, well then the next
time you might bring an extra $10.
125
00:07:20.589 --> 00:07:23.211
If you originally thought it
was gonna be between 0 and
126
00:07:23.211 --> 00:07:27.429
$1,000 then the cost ended up being $40,
it's not clear that you would change
127
00:07:27.429 --> 00:07:29.992
anything about your behaviour
based on this data.
128
00:07:29.992 --> 00:07:35.490
And so, if there is no change
in what you might think or
129
00:07:35.490 --> 00:07:38.950
what you might do based on
the collection of the data and
130
00:07:38.950 --> 00:07:43.140
matching it with your expectations,
then that's often a sign that
131
00:07:43.140 --> 00:07:46.740
either the evidence from your experiment
is not very strong or the data analysis
132
00:07:46.740 --> 00:07:51.420
was not able to generate enough evidence,
or there may be some other problem.
133
00:07:51.420 --> 00:07:55.890
With your study or
your data analysis process.
134
00:07:55.890 --> 00:08:00.130
So setting the right expectations and
making them as sharp as possible
135
00:08:00.130 --> 00:08:03.730
is a really key element to this
whole data analysis cycle.