Unemployment (splitting lines) Transcript

Start visual description. The instructor’s screen is shared where he shows how to prepare and write code for a particular program. He demonstrates the steps as he describes each aspect. The instructor can be seen in the top right-hand corner in a small box. End visual description.

[00:00:00] Instructor: Hey! Let’s walk through how to do the unemployment activity. This is going to be a reading, writing files pattern but now with the extra detail of how to use split in order to get information out of the file.

[00:00:17] So here we’ve got this unemployment dot txt file, looks something like this. This is what we call fixed column width format, or fixed width format. Regardless, the idea here is that I’ve got spaces that separate the values of each column and since splits on whitespace, it works great at splitting this information apart and turning it into a list for us that we can use.

[00:00:45] And so we want to write a program that takes a file that has the same format and a year from the command line and it’s going to print the average unemployment observed for that year.

[00:01:00] So before we go further, let’s start to give our program some structure and get away from an empty file and start to think about what we have. So, come here, unemployment, it’s empty, right? We know we’re going to have a main, it’s going to do something and we know that if name is equal to name, we’re going to call main.

[00:01:26] From there, let’s come back and let’s look at what information does our program need from the user, from us, in order to do its job? This program takes a file like unemployment dot txt, that’s one thing we’re going to need, and it takes a year. So those things I put up in here, right? So we’re going to say input file and year. I have two things coming in. Using a little doc string can help. Now your PyCharm may include these extra details. Or it may not, this is just a convention for how to format things, but I could describe them here since I’ve got it.

[00:02:11] What is my input file? And if this is an auto pop-up for you, just do something like this. Input input file and describe it. So what is my input file? A file with fixed width columns. I could say, you know, an example line looks something like this.

[00:02:35] I could even put the header on here, which is the first line. That helps me know what everything means. Or I could even do something like this. I can get a fancy, say, example, right? And we’ll just put some spaces in there, right? This is about what it looks like. So seeing this data in my code can help me keep track of what I’m trying to do. Coming over here, a string, or it could be like this, string the year to select.

[00:03:14] And then I want to return nothing. Nothing. But I do want to print the average unemployment for that year for the specified year. So that helps me see the intent of my program. I’m going to take a file that looks like this. Here’s an example of it and I’m going to take a year that’s in string format and I’m not going to return anything, but I’m going to print the average for the year.

[00:03:52] Now, before I continue here, I can see that PyCharm’s warning me, there’s a little yellow here, I haven’t passed an input file in a year. Where are those going to come from? Let’s pass those in on to command, right? So I can say I use sys argv one for our file and sys dot argv two for a year. And it’s going to tell me that I haven’t imported sys yet as an unresolved reference, but I can click here on this blue link, import sys. That’ll add it for me.

[00:04:23] Great. So now let’s dive in just a little bit more. Now we can see what we’re trying to do. But I guess there’s a little bit more I can even include here because this is going to be, you know, I need to read my lines and then I need to do some list patterns to them. I’m not going to write those lines, so it’s just a two-step process. But it’s the same overall pattern, right? Read the lines, do something to the lines.

[00:04:52] So I’m going to go back to count BYU from the previous video and I’m just going to grab read lines, paste it up here because I know it works. And now down here, I can say lines equals read lines in the file. Great.

[00:05:10] And then I need to process them somehow, right? I could say average unemployment equals get average unemployment given some lines and then I can print it. Print will make it look nice. The average unemployment in the year was average unemployment. Sounds nice. So get average unemployment, we just need to write that part now.

[00:05:47] And specifically, we want the average unemployment for a specific year which we passed in up here. So let’s give that information to average unemployment as well and say, hey, go create this function for us. All right, get average unemployment given all my lines in a given year.

[00:06:07] So what’s the process here? Well, this is just a few different list patterns, right? I have lots of data for lots of years, but I want a specific year. That’s a filter pattern. Then given just that year, I want to add up, or average, a certain piece of information. Well, that’s just an accumulation pattern, right? So why don’t we break this up to get lines for the specified year? And then I can say, get average unemployment from on those lines.

[00:06:53] You could think of different ways of splitting this up, but here’s two simple patterns that we already know. And if I can just do those two one after the other, problem solved, right? So I can say, you know, year lines equals filter year lines and year. And that will give me lines just for that year. And then I can say get average employment from those lines.

[00:07:31] Now, I see there’s a problem here. We used this name, right? I have two functions that both do about the same thing. And so maybe here I’m going to say for year get average unemployment for a specific year, which is really what we wanted to do. And then here, get average unemployment, you’re just going to get for whatever lines we provide it.

[00:07:57] Filter year. Well, that’s just a filter pattern, right? Keepers equals empty list for line in lines. If should keep line and year keepers got a pen line return keepers, right? How do I know I should keep line and year, keepers dot append line. How do I know I should keep a line based off of a year? Well, that’s a good question. What do we have? Let’s look at our data again.

[00:08:36] So I’ve got these four columns and I’m interested in the information that’s in the second column, that the second column matches. What are the indexes? It’s going to be index zero, one, two, and three. So if index one matches my year, that’s what I want to keep.

[00:08:56] So I can say parts equals line dot split and then I can say return parts one is equal to year. So if that second column position one of the split line matches, then we’ll keep that line. And now we can filter them out. We could pause right here if we wanted to. We could put a breakpoint in, come down here, comment this out and say main and now we would just need unemployment dot txt as an input and a year, 1999.

[00:09:40] And I can debug this file and let’s see if our filter has worked so far. So here, we’ve paused. We have year lines and so a filter year has worked appropriately, everything in year lines should only have 1999 in it. So I can open this up here and that looks promising, right? Looks like we got 1999 has all the lines. Great.

[00:10:11] So let’s pause that. Now we can move on. What’s the average unemployment for a bunch of those? Let’s create a function for it. So this is just an aggregation pattern, average, right? So total equals empty equals zero for a line in year lines.

[00:10:35] You know, this variable name, it’s a little overspecific, right? It doesn’t have to necessarily be year lines, it could have been any kind of line. So I actually like to change the names to be only as specific as they need to be. So given a bunch of lines, let’s compute the average unemployment.

[00:10:56] So for line in lines, now I need the pieces again, right? So I could say parts equal line dot split and I need to remind myself where is the unemployment? That’s in position three. Maybe up here, I would say, you know, unemployment in position three.

[00:11:19] So parts equals line dot split, unemployment is equal to parts sub-three. This is going to give me a string, but if I’m going to compute the average, I need it as a number and these have decimal places. So I’m going to make that a float and now I can say total plus equals unemployment and then at the very end return total divided by the number of lines that we’ve processed and there’s our average unemployment.

[00:11:56] So now, why is this whining at me? Let’s look at this. This doesn’t return anything. Oh, that’s interesting. Let’s come follow this function. And we can see here. Oh, we’re not returning anything. Well, we need to. We want to return whatever this returns. So we’ll say that right. Return the result of this computation which is now average unemployment will be and it should print it out.

[00:12:21] Cool. Let’s give it a try. So open up our terminal, Python unemployment dot py. Let’s give it the unemployment dot txt and the year. I hear somebody saying, do 1989. So we’ll run that. And here it says the average unemployment in the year 1999 was 4.19 blah, blah, blah, blah.

[00:12:49] So why did it do 1999? Because we forgot to switch it back, right? So we’re going to come over here, comment that out, comment this back in. I guarantee if you debug this way, you’ll forget once or twice. Just know it’s up there. Also, this is silly to have it that big. So I’m going to use a round right here to one decimal place just to simplify that.

[00:13:13] All right, let’s try this again. 1989. It was 5.2. 1999, right? We can check that. We could do 2010 and compute that. So now we’ve written this function, this program, we can pass a different input in to get new information out. That’s kind of cool.

[00:13:32] I love data science. And this is the very basic beginnings of that idea of writing programs that can go out and read information and process it for us and give us the part that we care about. So with that, happy data hunting and good luck writing unemployment dot py.