Sunday, December 27, 2009

large samples.

With access to large data sets, we can run standard statistical tests with much large sample sizes than ever before.

When this happens, it becomes really easy to have a very low p-value. Forget 0.05, we're talking 0.00001.

A more interesting question becomes: Is the difference large enough to be interesting?

Monday, November 23, 2009

The Job Hunt.


Something obvious occurred to me. The point of any job hunt is to find a new local maximum.

Here's the claim: With each job and payrate, there's a certain set-point of happiness that the job returns to. Get a pay raise, and that set point goes up -- but less than we'd like to think.

But, get a job doing something you find more important, or just more rewarding, and the set point goes up a lot.

And here's when you should change jobs: When the set point for your current job is considerably lower than where you want to be.

I've got a chart for that. This chart shows the original happiness at a job, and what happens when reality hits. A pay raise goes along with increase in happiness, as does additional responsibilities.

Here's the problem: At least in this example, the happiness increases in the position have failed to keep up with the desired satisfaction.

There are two ways out of this: change a job or lower expectations.

Which do you think should happen?

Wednesday, September 2, 2009

Teaching Night School.

It is official: If I finish the paperwork, I'll be teaching night classes in October!

While they are going to pay me for this, I consider this to be largely volunteer work. It is for the Arlington Public Schools Adult Education, and it is largely open to anyone who can pay. That price is:
$109 Arlington Residents
$85 Arlington Seniors
$145 non-Arlington Resident
$109 Non-Resident Senior

The course is entitled Excel: Beyond The Basics. According to the course description I've been handed, it will cover:

--Named Ranges.
--Conditional formatting
--Logic Functions to test data
--Comments
--External Data

This needs to happen in 3 3-hour session.

I am almost certainly going to wind up tailoring from the podium. With a 10 minute break each session, 10 minutes to warm up and cool off each time, and a half hour of introductions in the first session, I've got 7 hours. That's about 84 minutes per topic.

Here's the plan:

---- Named Ranges ----
If I'm doing named Ranges in 80 minutes, I will probably cover 2 ways to create them, explain the Named Ranged Edit Box, and show them used in a few formulas. Then give a 20 minute exercise where they create a named range and use it in a formula. We'll then do a second formula. Then we update the named range, and show how it cascades through each formula.


---- Conditional Formatting ---
As for conditional formatting, I'll start with "wouldn't it be nice if negative numbers could easily came up in red?" -- and yes, i know this is doable through other means. But, if we allow the zero to be arbitrary, it becomes much more useful.

For instance, when measuring out my monthly spending, I want Excel to tell me in RED if my checking balance will ever be below a thousand bucks.

---- Logical Functions To Test Data ---
Here's the real trick: We just did this in "Conditional Formatting". This'll be a lot of 'if my outgoing money is bigger than my income revenue, what happens?'

I'm liable to find or create a decent data set and do some fun work with "and" and "0r". I'll avoid NAND.

---- Comments ---
the most useful thing comments can do for is is the same thing they do in code: Tell a future version of yourself or other programming what the hell you were talking about. I'll discuss this. Then I'll pull up a spreadsheet without any comments and ask them what it means. Then I'll pull up a decently commented version, and it'll be a lot easier to figure out.

---- External Data ---
We may do a web query, at least if there is active internet access. Otherwise, I'll bring in a comma-delimited file and we can figure it out.

---- Pivot Tables ---
If we have any time left, I'm liable to go back to the well and discuss pivot tables. We'll talk about the utility of pivotting data, and why it matters. I'll use a data set -- possibly a credit card statement -- with a few hundred rows. Then, we pivot the data and some things become almost immediately obvious.

Speed of thought visual data analysis. Awesome.

So, that's my plan to teach Excel to a population that shows up and wants to improve themselves.

This is a plan in progress, and I expect it to be modified as time goes on. I also expect I will wind up doing some tailing from the podium to bring the course into line with the expectations of the students. I hope they feel comfortable telling me if they have stopped following, or if I am going to slow. To ensure that, I'll need to establish trust in the beginning.

My guess is credibility will be easy to establish with this crowd, but trust a lot harder than with the government analysts I've been training.

I think I'll do that by:
1) Not stressing the fancy degrees, but mentioning.
2) Making intentional mistakes.
3) Asking everyone what their expectations are, and constant checkins on how they feel.
4) Dressing less fancy than when I deliver for the company. I'm thinking sandals plus business casual.

If I manage all that, I should be able to establish credibility, maintain trust and ensure attention.

Friday, July 31, 2009

Responsibilities.

I work for a training firm. What we do is assist employees, usually federal ones, in developing skills, both for the job and --- often -- skills that are useful in other ways.

Over the last year my responsibilities have grown tremendously.

In May, I taught my first course, a 3-day with 5 participants. In June, I taught a 5-day course with neither backup nor a safety net. There were 18 people in the course. In July, I created the majority of a new course. It was a major rewrite of some curriculum, and I won most of the fights I was really interested in. Two weeks ago, my boss put me on call while he was teaching a class in case he needed to attend to family matters.

In August, I'm teaching 2 courses. There is one work day between the two. This will be a total of four classes taught this summer.

Last summer, working for the same company, I wrote some exercises and took a course from another division of our company.

This is a fairly sizable difference in responsibilities. I've become a trusted member of the workforce, and am directly interacting with our clients and representing the company.

If that's what a year has done, I wonder what another year will do?

Saturday, July 18, 2009

On the nature of clothes not fitting.

I have difficulty buying suits. Apparently the ratio of my waist to my shoulders is a little off from the "standard". This results in a few problems that invariably cost me additional money.

My understanding is women have it a lot worse, and that it can be a challenge for many women to find clothes in stores at all.

I don't know how this is possible. Imagine going to a car dealer and not being able to fit into a car. Or to Chipotle and being unable to eat rice.

There are a few potential solutions. One is sites like etsy, or other ways of having clothes custom made. This need not be expensive, but it isn't a system-wide solution.

Instead, I'd to recommend a 2-sigma solution. Imagine a starting clothing company that has decided that 95% of the population will be able to find clothes off the rack. Assuming that the shape of the population is roughly normal, how would this be possible?

A few steps:
1. Take a sample of the population. Use about 385 individuals per sample, and try to reduce bias in the sample.
2. Measure everything about these folks. Inseam, waist, distance from waist to armpit, length of arm. Get the ratios. Figure everything out you need to make them clothes.
3. Repeat steps (1) and (2) a few times. Graph the points, and look for differences. From these multiple samples, create a 95% confidence interval.

Now you've measured a small sample of the population, and have a pretty decent interval measure. For optimal results, take measurements each season. Then not only do you know what clothes are needed, but you also know the changing shape of the population.

My guess is that I'm well within 2 sigma, and that many clothing stores simply do not cater to that wide of margins of the population.

The first company to do so should make a killing. The second to do so is liable to kill the first.

Thursday, June 11, 2009

Being Appreciated.

I've now been in the workforce for a year.

I've been taking on more responsibility.

Official as of mid-June, I'm making a significant amount more than I was hired at.

Still less than my loans, but a nice raise and -- technically -- a promotion.

I'm going to be doing the exact same job, but my title is going from "Analyst" to "Senior Analyst."

My salary is increasing by a little bit more than 5% over the company-wide 3% we had earlier this year.

Its a little more than 5% as, instead of increasing by a percentage, the raise goes to a round number that is close to the percentage.

It is good to be appreciated for my successes.

Next week: I'm delivering a 5-day, 40-hour behemoth of a course that we just rewrote.

Monday, May 18, 2009

The Economics of Moving.

I've been moving in the past month. This should due a large part to explain the lack of posting. Moving furniture is time consuming and expensive.

Due to this, I've been thinking about the payback period on the amount I've expended to move. While there are a lot of quality of life differences (such as being a lot closer to Dianne), that only comes off in derived statistics.

Let's see what we can tell.

Previous expenses:
Rent: 1200
Groceries: The occasional trip to the store. If alone, it would be crap. If with Dianne, we'd go find stuff 3 or so times a week. My average monthly grocery store bill for the last year is $230. The above chart shows this spending pattern, if it can be called a pattern.
Electricity: Included in rent.
Internet: Municipal wifi that was unreliable.
Netflix: We used Dianne's, and I mooched. Free.
Total Monthly Expenses: Call it 1430.

Now:
Rent: 800
Groceries: We're getting $50 of groceries delivered each week, of which I'm paying half. I expect this to reduce my grocery store bill by at least half. Call it $100 plus the $100 of delivered groceries. $200 / month.
Internet: We're splitting FiOS, which should run around $30 each.
Netflix: We're upping to 3, and I'm going to pay half. Call it $10 a month.
Electricity: ~$25/month
ZipCar: $15
Total Expenses: $1080 / month

That is, each month I expect to save nearly $350 over how much I was spending.

However, merely moving is pretty expensive. We both paid an extra month's rent so that we didn't need to do it over a weekend. There's also the expense of needing to eat out more often, but my grocery store spending is so varied that I don't feel right making a guess as to how much I didn't spend there.

Even ignoring that, I've spent an extra $1200 this month for double rent. If that's everything, then the question becomes:

How long will it take me to repay the $1200 I spent moving, if I'm saving $350 each month?

The answer to this question is pretty obvious -- a little under 4 months -- and tells me that the move is cost effective, if we look at any considerable length of time. Certainly over the course of a year it will be incredibly effective.

My next question is: When do I reach my savings goals?
First goal: One months set expenses. The amount needed went down. I now have, in checking, enough for everything that comes through my checking account. Win.
Second Goal: 3 months expenses. This is a little harder. My spending over the last year has been 20546.58. However, this includes some outliers -- like purchasing all my furniture in one month. The giant peak on the chart is the Ikea trip.

If we ignore the months where I spent furniture, the average is around $1500. That includes the grocery store and other expenses that should be going down, so let's call it $1000.

To have 3 months expenses, I need about $6000.

When I started this job, I had $6000 and could last 12 months on it. That's lifestyle inflation.

So, this question reduces to the fairly simple: If I'm saving 350 dollars a month and already have a thousand dollars, how long will it take me to get to a total of $6000?

The answer is, of course, 14 months and about a week.

That's not so bad, but here's hoping I get a raise.

Friday, March 27, 2009

On High-Level Data.

Every few weeks, someone will send me some data.

Most of the time, it winds up being high-level summary data. They'll give me the number of apples sold in a month, but won't send me the log of apples being sold.

What I want is to be given a list of everytime an apple was sold, its weight, the store it was at, who was on the register, and everything.

I'm told this is "too much data". Excel can only do about 65536 rows at any one time (yeah, I've got 2003), but I can edit it down using other desktop tools. Namely, I'd use Access to query it down to what I'm interested in.

Yet, the various Analysis departments I work with (with the notable exception of the Department of Transportation) don't seem to see the utility in having more than the high-level data.

I'm not sure how to make folks become less afraid of a full data sheet. But it needs to happen, so that folks with the know-how will have the tools that they need.

Thursday, March 19, 2009

Updates.

I'm accepting -- and demanding -- more responsibility at work.

I work for a training firm in the DC area. The group I work with does analytics training. This involves a lot of excel work, as well as problem solving. We train government analysts, and a lot of the job is just getting folks to not be intimidated by the white screen that is a blank excel spreadsheet.

When I started I asked to be in the classroom. This was almost 10 months ago. It only happened this week. I could chalk up the wait time to the winter season, but I think that would only be part of the truth.

I think the training I've been going through, as well as settling into working a conventional job, have let me become much more ready than I was when I started.

I was in the classroom this week. Alongside my boss, and we co-facilitated the course.

Sometime this summer, I should get to do a course without assistance. It will probably be an overflow session of our single most popular course.

Tuesday, February 17, 2009

AARGGH


I'm one of many community organizers of the Alexandria-Arlington Regional Gaming Group. We get together and play games.

Everyone contributes how they choose. Most of the organizers help host events. In addition to that, one thing I do is to download the member list and perform some basic analysis on it.


This chart shows three variables. The red line and the faded one use the Y-axis to the right, while the blue line uses the one to the left. The time period begins at the point that the mainstay of the AOs had joined. The days before this graph starts were extreme outliers and are not predictive of future growth.

The taller, blue line shows total membership. We continue to have new members, but that is not the really important part of the story.

The faded line in the background is the daily membership join rate. This is in the background as it is the least important, and I want attention drawn to the other variables. As this is discrete points in time, I was hesitant to use a line graph at all. But without the lines, it looks really crowded; each of the individual points becomes very eye catching. Instead, I faded the line and got rid of the points.

I might entirely remove this line, but wanted to include it so my audience could see how variable the daily join is. This is a good contrast with the Rolling Average, which has an obvious, visual downward trend.

If I've done things correctly, the eye should be drawn to the red line. Red catches the eye, and suggests either "stop" or "bad". The red line is the average join per day, and I've graphed it on the same scale as the Daily Join rate. This has been going down since early January, and was going down before a NYE-era increase.

If the rolling average is decreasing, then we're attracting fewer new people as time goes on. It looks fairly steady since about January first, despite a surge just after then. While there are some odd patches after that, it is a good a time as any to use a beginning point. It has the psychological factor of describing a discrete year.

To find out how strong this connection really is, we need to compare the day and the daily join rate using something like a Pearson's Correlation Coefficient or simple linear regression. This data is all in Excel, which lists dates as serial days since January 1, 1900. Instead of using that, I can convert the days into the number of days since January 1, 2009. I can then do a linear tregression and get something more meaningful. I could just find the correlation, but a linear regression allows for extrapolation.

Either way, this is where Excel's Analysis toolpack comes in handy.

When a regression is perfomed, I get a Multiple R of 0.82. More or less this means that 82% of the change in daily join rate can be computed based solely upon a non-zero starting point and the days since January 1.

That's pretty significant. The p-values are all very near zero (ie, they would be unconnected in a very small number of distributions that look like this one), so this is likely a significant relationship. The coefficient is -0.014, meaning each day about 1/10th less people join. Yes, people come in discrete packets, but for a moment that can be ignored.

If we take this moderately seriously, then it is about 130 days until membership increase drops to zero. At which point, we'll be at 263 members and it will be June 24.

Granted, this isn't actually very likely. Each time the group meets, new people show up. These events act like intervention on the the variable of Meetup Size, which is the deterministic cause of Daily Join Rate. Supposing that nothing dramatic or terrible happens, this is likely to be a low-end estimate of the size near the end of June. This sort of extrapolation is known to be be particuarly exact, especially with a R^2 so far from one.

Unfortunately, the group list that meetup.com gives its users does not include information about users who have left a group. Without that information, the only way to find out if the size is really going down is to compare different time period.

With a few tricks in Excel and meetup's very basic member list, this sort of analysis can be generated in a few minutes. The trick is not to take it too seriously, especially extrapolation of a linear regression.

Friday, February 6, 2009

Net Present Value vs. Buffer Space.

I don't want this blog to become a personal finance blog, but there is some math to be done with those numbers.

When I got my job last June, I had several thousand dollars in savings. At the rate I was spending it, I had enough money to last until December. That was seven months away when I got the job.

Between renting an apartment, office-appropriate clothes and a few trips to Ikea, I was in credit card debt almost immediately. Different lifestyles require different amounts of money.

That's all paid off now, but the job has still not paid for itself.

By that, I mean the following several statements:
1) I do not have the same amount of cash in hand now as when I started this job.
2) The net present value of my cash now is less than the net present value of my cash in June, 2008.
3) If money stopped comining in, the amount of time I could live off the money I have is less than it was June, 2008.

There's some simple math behind most these. The first is a simple comparision.

To fully calculate the second one, we need to know the inflation/deflation rate over the last year. According to the consumer price index, a 2009 dollar is 1.0097 in 2008. This is a pretty low difference. And since there has been inflation and not deflation, merely knowing that I have less total dollar tells me I have less real wealth.

The third one doesn't require the CPI, but just knowledge of my own finances. I'm spending around four times what I was during unemployment. So, I would need to have four times the wealth to have the same buffer.

I don't.

The job hasn't paid for itself, and at the rate I'm going, I won't hit the first marker for another year.

Maybe I really should get a big raise.




Thursday, February 5, 2009

A Preliminary Post

This blog should serve as a repository for how I approach problems in the world. Sometimes I wonder about what's going on in my life, and finding data to predict events has been helpful. In everything from taxation to the size of my Gmail account, I've got access to information that can help in decision making. Or at least lower my stress.

With taxes, I want to know if my grad school loans count as a person. It turns out that, for the sake of a W-4, my loans are a dependent.

I feed them Wheaties.

It also turns out my gmail account won't run out of space in the forseeable future.

I'll be keeping track of how I approach problems, and how I attempt to solve them. This is basically what I do at work, so to avoid intellectual property concerns, I won't mention workplace projects.

Will my Gmail run out of space?

I have a strong relationship with Gmail.

I use it for all my personal emails. I used it for all my CMU email, and I would use it for work if they let me. I keep names, records, and use it as extended memory. I've got no idea when I last emailed my mom, what my girlfriend's email address actually is, my boss's email, or when I last saw the doctor.

Its all there, archived where I can find it.

Suffice it to say, I don't want to run out of space.

To find out if I will, I gathered some data. The information I have is:
1) I started this gmail account on March 14, 2005
2) I've used a total of 1.182 gigabytes
3) the current space allowed is 7.292 gigabytes.
3) Gmail was started on April 1, 2004 with exactly 1 gigabyte.
4) Today is February 5, 2009.

From here, its easy enough to give some derived values. The average amount of space I use is 830 kilobytes per day. The space allocated increases by 3.5 megabytes a day.

Both of these numbers are huge, and Adventures With Large Numbers will be another post.

Its important to remember this these are just a simple average. If I actually kept a record, I'd see irregularities and discontinuities. Memory tells me they doubled the space on April 1, 2005. But, that's not the data I've got, and I know better than to trust my memory. I don't have the daily sizes, so I cannot easily use that for analysis. Nor do I need it for a ball-park estimate.

So long as the amount raised is greater than my daily use value, I shouldn't run out of space. As it turns out, the the amount they raise my storage allowance by is four times the amount I use, I think I'm in the clear.

If I was using closer to half the average daily increase, I'd want to find the daily values and make better predictions about April First. I might even set up both as exponential relationships.

If my Daily Use Value was higher than the Average Daily Add, I'd make a prediction on when I'd run out of space. Given the current values of each, it would be easy enough to make that prediction.

Even if was using half a megabyte more than I am getting, my gmail wouldn't fill until 2042. This would require using roughly five times the space I currently am. And, in 2042, when it was finally full, I'd have 50 gigabytes of space.

All in all, if the future is sufficiently similar to the past, I won't run out of space.