Unemployment (splitting lines) Transcript
Start visual description. The instructor’s screen is shared where he shows how to prepare and write code for a particular program. He demonstrates the steps as he describes each aspect. The instructor can be seen in the top right-hand corner in a small box. End visual description.
[00:00:00]
Instructor: Hey! Let’s walk through how to do the unemployment activity. This is
going to be a reading, writing files pattern but now with the extra detail of how
to use split in order to get information out of the file.
[00:00:17]
So here we’ve got this unemployment dot txt file, looks something like this. This
is what we call fixed column width format, or fixed width format. Regardless, the
idea here is that I’ve got spaces that separate the values of each column and
since splits on whitespace, it works great at splitting this information apart and
turning it into a list for us that we can use.
[00:00:45]
And so we want to write a program that takes a file that has the same format and
a year from the command line and it’s going to print the average unemployment
observed for that year.
[00:01:00]
So before we go further, let’s start to give our program some structure and get
away from an empty file and start to think about what we have. So, come here,
unemployment, it’s empty, right? We know we’re going to have a main, it’s going
to do something and we know that if name is equal to name, we’re going to call
main.
[00:01:26]
From there, let’s come back and let’s look at what information does our program
need from the user, from us, in order to do its job? This program takes a file like
unemployment dot txt, that’s one thing we’re going to need, and it takes a year.
So those things I put up in here, right? So we’re going to say input file and year. I
have two things coming in. Using a little doc string can help. Now your PyCharm
may include these extra details. Or it may not, this is just a convention for how to
format things, but I could describe them here since I’ve got it.
[00:02:11]
What is my input file? And if this is an auto pop-up for you, just do something like
this. Input input file and describe it. So what is my input file? A file with fixed
width columns. I could say, you know, an example line looks something like this.
[00:02:35]
I could even put the header on here, which is the first line. That helps me know
what everything means. Or I could even do something like this. I can get a fancy,
say, example, right? And we’ll just put some spaces in there, right? This is about
what it looks like. So seeing this data in my code can help me keep track of what
I’m trying to do. Coming over here, a string, or it could be like this, string the year
to select.
[00:03:14]
And then I want to return nothing. Nothing. But I do want to print the average
unemployment for that year for the specified year. So that helps me see the
intent of my program. I’m going to take a file that looks like this. Here’s an
example of it and I’m going to take a year that’s in string format and I’m not
going to return anything, but I’m going to print the average for the year.
[00:03:52]
Now, before I continue here, I can see that PyCharm’s warning me, there’s a little
yellow here, I haven’t passed an input file in a year. Where are those going to
come from? Let’s pass those in on to command, right? So I can say I use sys argv
one for our file and sys dot argv two for a year. And it’s going to tell me that I
haven’t imported sys yet as an unresolved reference, but I can click here on this
blue link, import sys. That’ll add it for me.
[00:04:23]
Great. So now let’s dive in just a little bit more. Now we can see what we’re
trying to do. But I guess there’s a little bit more I can even include here because
this is going to be, you know, I need to read my lines and then I need to do some
list patterns to them. I’m not going to write those lines, so it’s just a two-step
process. But it’s the same overall pattern, right? Read the lines, do something to
the lines.
[00:04:52]
So I’m going to go back to count BYU from the previous video and I’m just going
to grab read lines, paste it up here because I know it works. And now down here,
I can say lines equals read lines in the file. Great.
[00:05:10]
And then I need to process them somehow, right? I could say average
unemployment equals get average unemployment given some lines and then I
can print it. Print will make it look nice. The average unemployment in the year
was average unemployment. Sounds nice. So get average unemployment, we just
need to write that part now.
[00:05:47]
And specifically, we want the average unemployment for a specific year which we
passed in up here. So let’s give that information to average unemployment as
well and say, hey, go create this function for us. All right, get average
unemployment given all my lines in a given year.
[00:06:07]
So what’s the process here? Well, this is just a few different list patterns, right? I
have lots of data for lots of years, but I want a specific year. That’s a filter pattern.
Then given just that year, I want to add up, or average, a certain piece of
information. Well, that’s just an accumulation pattern, right? So why don’t we
break this up to get lines for the specified year? And then I can say, get average
unemployment from on those lines.
[00:06:53]
You could think of different ways of splitting this up, but here’s two simple
patterns that we already know. And if I can just do those two one after the other,
problem solved, right? So I can say, you know, year lines equals filter year lines
and year. And that will give me lines just for that year. And then I can say get
average employment from those lines.
[00:07:31]
Now, I see there’s a problem here. We used this name, right? I have two
functions that both do about the same thing. And so maybe here I’m going to say
for year get average unemployment for a specific year, which is really what we
wanted to do. And then here, get average unemployment, you’re just going to
get for whatever lines we provide it.
[00:07:57]
Filter year. Well, that’s just a filter pattern, right? Keepers equals empty list for
line in lines. If should keep line and year keepers got a pen line return keepers,
right? How do I know I should keep line and year, keepers dot append line. How
do I know I should keep a line based off of a year? Well, that’s a good question.
What do we have? Let’s look at our data again.
[00:08:36]
So I’ve got these four columns and I’m interested in the information that’s in the
second column, that the second column matches. What are the indexes? It’s
going to be index zero, one, two, and three. So if index one matches my year,
that’s what I want to keep.
[00:08:56]
So I can say parts equals line dot split and then I can say return parts one is equal
to year. So if that second column position one of the split line matches, then
we’ll keep that line. And now we can filter them out. We could pause right here if
we wanted to. We could put a breakpoint in, come down here, comment this out
and say main and now we would just need unemployment dot txt as an input
and a year, 1999.
[00:09:40]
And I can debug this file and let’s see if our filter has worked so far. So here,
we’ve paused. We have year lines and so a filter year has worked appropriately,
everything in year lines should only have 1999 in it. So I can open this up here
and that looks promising, right? Looks like we got 1999 has all the lines. Great.
[00:10:11]
So let’s pause that. Now we can move on. What’s the average unemployment for
a bunch of those? Let’s create a function for it. So this is just an aggregation
pattern, average, right? So total equals empty equals zero for a line in year lines.
[00:10:35]
You know, this variable name, it’s a little overspecific, right? It doesn’t have to
necessarily be year lines, it could have been any kind of line. So I actually like to
change the names to be only as specific as they need to be. So given a bunch of
lines, let’s compute the average unemployment.
[00:10:56]
So for line in lines, now I need the pieces again, right? So I could say parts equal
line dot split and I need to remind myself where is the unemployment? That’s in
position three. Maybe up here, I would say, you know, unemployment in position
three.
[00:11:19]
So parts equals line dot split, unemployment is equal to parts sub-three. This is
going to give me a string, but if I’m going to compute the average, I need it as a
number and these have decimal places. So I’m going to make that a float and
now I can say total plus equals unemployment and then at the very end return
total divided by the number of lines that we’ve processed and there’s our
average unemployment.
[00:11:56]
So now, why is this whining at me? Let’s look at this. This doesn’t return
anything. Oh, that’s interesting. Let’s come follow this function. And we can see
here. Oh, we’re not returning anything. Well, we need to. We want to return
whatever this returns. So we’ll say that right. Return the result of this
computation which is now average unemployment will be and it should print it
out.
[00:12:21]
Cool. Let’s give it a try. So open up our terminal, Python unemployment dot py.
Let’s give it the unemployment dot txt and the year. I hear somebody saying, do 1989. So we’ll run that. And here it says the average unemployment in the year
1999 was 4.19 blah, blah, blah, blah.
[00:12:49]
So why did it do 1999? Because we forgot to switch it back, right? So we’re going
to come over here, comment that out, comment this back in. I guarantee if you
debug this way, you’ll forget once or twice. Just know it’s up there. Also, this is
silly to have it that big. So I’m going to use a round right here to one decimal
place just to simplify that.
[00:13:13]
All right, let’s try this again. 1989. It was 5.2. 1999, right? We can check that. We
could do 2010 and compute that. So now we’ve written this function, this
program, we can pass a different input in to get new information out. That’s kind
of cool.
[00:13:32]
I love data science. And this is the very basic beginnings of that idea of writing
programs that can go out and read information and process it for us and give us
the part that we care about. So with that, happy data hunting and good luck
writing unemployment dot py.