Mining What's Mine: February 2009

Tuesday, February 17, 2009

AARGGH

I'm one of many community organizers of the Alexandria-Arlington Regional Gaming Group. We get together and play games.

Everyone contributes how they choose. Most of the organizers help host events. In addition to that, one thing I do is to download the member list and perform some basic analysis on it.

This chart shows three variables. The red line and the faded one use the Y-axis to the right, while the blue line uses the one to the left. The time period begins at the point that the mainstay of the AOs had joined. The days before this graph starts were extreme outliers and are not predictive of future growth.

The taller, blue line shows total membership. We continue to have new members, but that is not the really important part of the story.

The faded line in the background is the daily membership join rate. This is in the background as it is the least important, and I want attention drawn to the other variables. As this is discrete points in time, I was hesitant to use a line graph at all. But without the lines, it looks really crowded; each of the individual points becomes very eye catching. Instead, I faded the line and got rid of the points.

I might entirely remove this line, but wanted to include it so my audience could see how variable the daily join is. This is a good contrast with the Rolling Average, which has an obvious, visual downward trend.

If I've done things correctly, the eye should be drawn to the red line. Red catches the eye, and suggests either "stop" or "bad". The red line is the average join per day, and I've graphed it on the same scale as the Daily Join rate. This has been going down since early January, and was going down before a NYE-era increase.

If the rolling average is decreasing, then we're attracting fewer new people as time goes on. It looks fairly steady since about January first, despite a surge just after then. While there are some odd patches after that, it is a good a time as any to use a beginning point. It has the psychological factor of describing a discrete year.

To find out how strong this connection really is, we need to compare the day and the daily join rate using something like a Pearson's Correlation Coefficient or simple linear regression. This data is all in Excel, which lists dates as serial days since January 1, 1900. Instead of using that, I can convert the days into the number of days since January 1, 2009. I can then do a linear tregression and get something more meaningful. I could just find the correlation, but a linear regression allows for extrapolation.

Either way, this is where Excel's Analysis toolpack comes in handy.

When a regression is perfomed, I get a Multiple R of 0.82. More or less this means that 82% of the change in daily join rate can be computed based solely upon a non-zero starting point and the days since January 1.

That's pretty significant. The p-values are all very near zero (ie, they would be unconnected in a very small number of distributions that look like this one), so this is likely a significant relationship. The coefficient is -0.014, meaning each day about 1/10th less people join. Yes, people come in discrete packets, but for a moment that can be ignored.

If we take this moderately seriously, then it is about 130 days until membership increase drops to zero. At which point, we'll be at 263 members and it will be June 24.

Granted, this isn't actually very likely. Each time the group meets, new people show up. These events act like intervention on the the variable of Meetup Size, which is the deterministic cause of Daily Join Rate. Supposing that nothing dramatic or terrible happens, this is likely to be a low-end estimate of the size near the end of June. This sort of extrapolation is known to be be particuarly exact, especially with a R^2 so far from one.

Unfortunately, the group list that meetup.com gives its users does not include information about users who have left a group. Without that information, the only way to find out if the size is really going down is to compare different time period.

With a few tricks in Excel and meetup's very basic member list, this sort of analysis can be generated in a few minutes. The trick is not to take it too seriously, especially extrapolation of a linear regression.

Friday, February 6, 2009

Net Present Value vs. Buffer Space.

I don't want this blog to become a personal finance blog, but there is some math to be done with those numbers.

When I got my job last June, I had several thousand dollars in savings. At the rate I was spending it, I had enough money to last until December. That was seven months away when I got the job.

Between renting an apartment, office-appropriate clothes and a few trips to Ikea, I was in credit card debt almost immediately. Different lifestyles require different amounts of money.

That's all paid off now, but the job has still not paid for itself.

By that, I mean the following several statements:
1) I do not have the same amount of cash in hand now as when I started this job.
2) The net present value of my cash now is less than the net present value of my cash in June, 2008.
3) If money stopped comining in, the amount of time I could live off the money I have is less than it was June, 2008.

There's some simple math behind most these. The first is a simple comparision.

To fully calculate the second one, we need to know the inflation/deflation rate over the last year. According to the consumer price index, a 2009 dollar is 1.0097 in 2008. This is a pretty low difference. And since there has been inflation and not deflation, merely knowing that I have less total dollar tells me I have less real wealth.

The third one doesn't require the CPI, but just knowledge of my own finances. I'm spending around four times what I was during unemployment. So, I would need to have four times the wealth to have the same buffer.

I don't.

The job hasn't paid for itself, and at the rate I'm going, I won't hit the first marker for another year.

Maybe I really should get a big raise.

Thursday, February 5, 2009

A Preliminary Post

This blog should serve as a repository for how I approach problems in the world. Sometimes I wonder about what's going on in my life, and finding data to predict events has been helpful. In everything from taxation to the size of my Gmail account, I've got access to information that can help in decision making. Or at least lower my stress.

With taxes, I want to know if my grad school loans count as a person. It turns out that, for the sake of a W-4, my loans are a dependent.

I feed them Wheaties.

It also turns out my gmail account won't run out of space in the forseeable future.

I'll be keeping track of how I approach problems, and how I attempt to solve them. This is basically what I do at work, so to avoid intellectual property concerns, I won't mention workplace projects.

Will my Gmail run out of space?

I have a strong relationship with Gmail.

I use it for all my personal emails. I used it for all my CMU email, and I would use it for work if they let me. I keep names, records, and use it as extended memory. I've got no idea when I last emailed my mom, what my girlfriend's email address actually is, my boss's email, or when I last saw the doctor.

Its all there, archived where I can find it.

Suffice it to say, I don't want to run out of space.

To find out if I will, I gathered some data. The information I have is:

1) I started this gmail account on March 14, 2005
2) I've used a total of 1.182 gigabytes
3) the current space allowed is 7.292 gigabytes.
3) Gmail was started on April 1, 2004 with exactly 1 gigabyte.
4) Today is February 5, 2009.

From here, its easy enough to give some derived values. The average amount of space I use is 830 kilobytes per day. The space allocated increases by 3.5 megabytes a day.

Both of these numbers are huge, and Adventures With Large Numbers will be another post.

Its important to remember this these are just a simple average. If I actually kept a record, I'd see irregularities and discontinuities. Memory tells me they doubled the space on April 1, 2005. But, that's not the data I've got, and I know better than to trust my memory. I don't have the daily sizes, so I cannot easily use that for analysis. Nor do I need it for a ball-park estimate.

So long as the amount raised is greater than my daily use value, I shouldn't run out of space. As it turns out, the the amount they raise my storage allowance by is four times the amount I use, I think I'm in the clear.

If I was using closer to half the average daily increase, I'd want to find the daily values and make better predictions about April First. I might even set up both as exponential relationships.

If my Daily Use Value was higher than the Average Daily Add, I'd make a prediction on when I'd run out of space. Given the current values of each, it would be easy enough to make that prediction.

Even if was using half a megabyte more than I am getting, my gmail wouldn't fill until 2042. This would require using roughly five times the space I currently am. And, in 2042, when it was finally full, I'd have 50 gigabytes of space.

All in all, if the future is sufficiently similar to the past, I won't run out of space.

Mining What's Mine