2002 Ford F150 – Maintenance & Nostalgia

“Big Red”

My earliest memory of working on a vehicle was sometime in mid-1980s, my father had a Chevy K5 Blazer, red & white, and it needed new front shocks. I remember squatting on the garage floor, watching him jack up the vehicle, and get down to business. It was amazing to see all the nuts and bolts that held a vehicle together.

As I got older, the bug to work on cars was not as strong. Grade school, I was tagged as having a knack for mathematics and computers. I still like to take things apart, though. Old toasters, hairdryers, broken remote control toys, and the one item that every kid had access to, a 1940s or 1950’s era nitrous oxide anesthesia machine in my late grandfather’s dental office. The nitrous was long gone when my grand mother let me take the contraption apart.

In the early 2000s, we had a 1997 Dodge Ram 1500 pickup. It had been my father’s vehicle. We did a bit of maintenance on this vehicle – brakes and coolant. Shortly after this truck, we had a 1994 Mazda B4000, similarly, we did a bit of maintenance on this vehicle, like a new alternator and serpentine belt, an A/C reconditioning, as well as adding a trailer hitch mount.

In my youth, in the 1980s, there was a large amount of carryover from the brown malaise time of the mid to late 1970s. Look at any episode of 1970’s Kojak, and you will see exactly what I mean. Also as a kid in the 1980s, you did not think that seeing a Dodge Omni or Chevrolet Chevette was something interesting. In 2019, it’s a bit different. Along my way from young through being college aged, I had been a periodic car enthusiast. The occasional 1960s Corvette, or when, on a nice summer day, you would see a Chrysler E body from the early 1970s. Was it a Dodge Challenger or a Plymouth Barracuda? My father would have known, if he had been with me in the vehicle.

Today, I find myself being a bit more of an enthusiast. Seeing a Dodge Omni, or Chrysler K body car, in good shape, will result in me slowing down and maybe turn around to get a closer look. That sixth generation Monte Carlo SS next to me at a gas station will make me linger a bit after the pump clicks off. The owner of the car isn’t at the pump, and you haven’t seen him, but you know that he’s a guy, with larger gray hair, a mustache, and maybe cowboy boots. The longer gray haired, mustached man returns from paying cash for his gasoline, and you nod at him, the nod is letting him know, implicitly, that you know his car has an LS4 engine in it. It’s a fast, throaty sounding car, but he doesn’t tear ass out of the gas station, he just loudly rumbles away.

There is this appreciation for cars and trucks that I do not remember having when I was a teen or even in my early twenties. Maybe there’s a tinge of nostalgia now that I am nearly 40. The 1995 Dodge Stratus that my and sister I drove in high school could sport collector’s license plates here in Minnesota, if said vehicle was still in existence, that is. After my sister traded it on a 2001 Toyota Camry in 2002, I’m sure that dark green Stratus beats its way around on Minnesota’s Iron Range for a while, from single mom to single mom, landing with its final owner, a guy, who dated one of the single moms, who probably daily drove it to his night shift at one of the taconite mines. With 170,000 miles on the engine, it blew a head gasket. The car was parked on rural land, where it hasn’t run in 15 years, and now has a tree growing through it. But, I’m only guessing as to the fate of that vehicle.

Like those Chrysler JA platform sedans, the tenth generation F-series pickups, are kind of unremarkable. The tenth generation of F-series pickups was a modernist take on the ninth generation. Styling-wise, the tenth generation wanted nothing to do with its previous, squarish bodied previous generations. Like the many theories of epistemology which argue that external and absolute reality could impress itself, as it were, on an individual, as, for example, John Locke‘s empiricism, which saw the mind beginning as a tabula rasa, or a blank slate. Likewise, Ford really wanted a clean slate. New body styles, new features and options, even a new lineup of engines, and a new target buyer. Unlike previous generations, the tenth generation F-series, and specifically the F-150, was designed and built for personal use. Prior generations emphasized work. The tenth generation was meant for people like my father. That 1997 Dodge Ram, I mentioned earlier, that was my father’s vehicle before it came into my possession. He daily drove that vehicle 24 miles, one way, simply to park it at an office building, where it would sit all day because he had a white collar job. The Rams, and F-150s, and Silverados of the late 1990s can be thought of as the slippery slope of features that produced in today’s market, vehicles that amount to luxury pickups. Pickups that cost $60,000 or more. Mind you, there is nothing wrong with wanting a pickup, there is nothing wrong with wanting to own a pickup just to drive it, just to get you to work, work that is not at a construction site. My father liked having a seemingly big truck. Thirty-four inch tires that rubbed in the wheel wells. Aesthetically, it was a throw back to the 1970s — brown malaise.

In 1990, Ford developed a 4.6L V8 that shared certain parts amongst a family of related engines, collectively known as modular engines. Starting with the tenth generation of F-150s, in 1997, Ford made a 5.4L two valve, single over head cam (SOHC) engine available in the modular family.

Intake Manifold & More Removed from two valve Triton engine

Our particular F-150 has this engine. It was noted in WardsAuto, that Peter Dowding, Ford’s V8 Modular Engine Manager, took the modular V8 engine’s design from good in 1997, to superb for 2002. Starting with the eleventh generation, Ford introduced a three valve modular engine that used a strange two-piece spark plug. That turned into a huge mess. Even today, when someone who thinks they are in the know with light trucks, asks, Does that F-150 you got have one of those piece of shit Triton engines in it? They’re thinking the three valve version of the Triton, not the two valve.

People have opinions on a lot of things, including trucks. With opinions, ubiquitous access to information in this modern age we find ourselves in, this is the golden age for introverted shade tree mechanics, such as myself.

Late last fall, I noticed our F-150 was idling a little rough, and the idle RPMs seemed to be a bit low. By late-January (it’s late April as I write this), we noticed what looked like coolant dripping from the exhaust. This, by the way, is the wrong location to normally find engine coolant.

EGR valve

Earlier in January, while my friend Andy was in Minnesota, we had come to the conclusion that the rough, low RPM idle was likely an exhaust gas recirculation valve that was not functioning correctly. (For long time readers, this is the Andy that has appeared here, here, here, here, and probably in a few other posts from over the years). Was the coolant coming from the EGR valve? Turns out, not on the F-150s, it is possible on the diesel variants of the F-250 and above but not the F-150.

You might be asking what is an exhaust gas recirculation valve, what does it do, and why would this cause an engine to have a rough idle. EGR valves first made an appearance in the early 1970s as an attempt to reduce nitrous oxide emissions. This is done by recirculating (hence the name) exhaust back into the air intake. With an EGR valve that is malfunctioning and potentially stuck wide open, the incorrect mixture of exhaust gases, air and fuel is making its way into the combustion chamber. This still was not answering the question about why there was coolant at the backend of the truck.

Some might be thinking, it’s gotta be a head gasket. Checking the coolant overflow reservoir showed just coolant, and not a chocolate milkshake of coolant and oil that would have been a dead giveaway for a head gasket problem.

I turned to internet forums, it is the golden age for information of all kinds, after all. Internet forums that are specifically for owners and keepers of F-150s. As it turned out, there were a number of individuals with tenth generation F-150s, with mileage in the low six digits, that had a similar issue of finding coolant in the exhaust.

Damaged coolant crossover gasket

The culprit was a gasket between the upper intake manifold and the aluminum cylinder heads. In addition to providing air to all eight cylinders, the intake manifold contains a coolant crossover, allowing coolant to be circulated between to the two halves of the engine’s “V” shape.

The failed gasket (pictured above) was allowing a small amount of coolant to seep into the combustion chamber, and then, on that very cold day in late January, condense, again at the tailpipe, and drip.

Somewhere along the way to sleuthing out the issue, I got it stuck in my head that I was going to fix this myself. This is mostly a nod to my late father-in-law, the previous owner of the F-150. It was also a nod to his love of gasoline combustion engines. Nearly two years ago, he and I rebuilt and tuned the single cylinder engine on an 1985 Yamaha 225 ATV. I feel like keeping the F-150 maintained, keeps a tacit part of my father-in-law with us. He loved internal combustion. Particularly those small, one and two cylinder engines found on lawn mowers and yard tractors.

The F-150’s 5.4L, single overhead cam, two valve V8 is not a small engine. Initially, I thought of this engine as being just a jacked up version of a V-twin found on both our lawn tractors, or even a times-eight of the engine on a chainsaw. There’s more to it than just being a times-four or times-eight of those engines. For starters, the V8 is fuel injected, the V-twins and chainsaw engines are carbureted, the V8 has vacuum lines, the small engines do not, the V8 is water/coolant cooled, the small engines are air cooled. On a chainsaw, you generally do not have a battery, instead you have a magneto that is used to generate spark. On the V8 Triton, there are ignition coils that transform the relatively low voltage of the battery into thousands of volts needed to ignite the air and fuel mixture in the combustion chamber.

Underside of old intake manifold

Before tackling the project, I read the F-150 forums on what it would take to replace the intake manifold. One person said that he was: 1) not a certified mechanic; 2) it took him about six hours to do the work; 3) that six hours included a break for lunch. Other individuals said to take your time and set aside an entire weekend.

I figured it would take me a solid weekend to get this project completed. In total, the project took maybe 8 hours of solid work, but there were multiple-days of doing nothing on the vehicle because of waiting for parts or tools to arrive. There was also a snow storm in between starting and finishing the project.

Throttle body, removed

Following removal of the throttle body, and EGR valve, and what seemed like a million feet of vacuum hoses, I had to drain the coolant from the engine block. Why? Remember the previously mentioned coolant crossover? Yes, that; there is coolant in this and if you just removed the hoses from the crossover and then took the intake manifold off…you would end up with a load of coolant all over the place, including into cylinders.

Dirty, old spark plug

Before you can remove the intake manifold, you need to pull all the fuel injectors out of the intake manifold, as well as pulling out all eight ignition coils. It was around this point that the scope of the project expanded. I decided to replace all the rusting injectors, as well as all the ignition coils. At the last minute, I also decided to replace all the spark plugs. The alternator had to come off to get the old manifold out – thanks to Ford putting a massive set of baffles on the underside, you cannot just tip and slide that bad boy without the alternator getting into the way.

Fuel Injectors

With the surface of the heads cleaned and prepped, I started to reassemble the engine. New spark plugs – tightened; I know that the 3 valve Triton’s with their asinine two piece spark plugs are something to behold, how you remove and replace spark plugs on the two valve Triton is different from what I’m used to on small engines. On a V-Twin motor on a lawn mower, or a chainsaw, the plug or plugs are sort of just out there – it is easy to pop the wire off, and use a spark plug socket to your business. Even from what I recall of replacing spark plugs on our Mazda B4000, the plugs were out in the open. Not with this Triton V8, the plugs are effectively buried in the engine.

With the new plugs in, the intake manifold could go on. With the intake manifold on, the ignition coils could go in. With the ignition coils in, the new fuel injects could be installed and the fuel rails reconnected. I just worked backwards from how I had taken things apart. Alternator, throttle body, EGR valve, reconnect the millions of vacuum hoses, reattach with new clamps the coolant hoses, reinstall the drain plug in the engine block, reconnect the battery, and fill the coolant system with new fluid.

The moment of truth for whether I had put it back together correctly came roughly 10 days after I had started – remember, snow storm, ordering parts & tools, and having the general scope of the project expand.

It worked. First stop, gave it a little gas, and that was it. No misfires, or problems. Could I have had someone else do all this work in less time, without having to buy tools, yes, definitely. The crux of the project was never to save money – I may have, if you exclude time-cost – the point of the project was a tacit nod at my late father-in-law, as well as being able to say I did something with this vehicle.

2005 GMC Savana 1500 Mirror Change

It was a short but extreme feeling winter here in St. Paul. We really did not get snow until sometime in December or was it January, and this was followed by week or so of extreme (for the Twin Cities) cold. Then, throughout February, we received snow; a lot of snow. The vernal equinox was nearly a month ago, and we had been having progressively warmer days, longer amounts daylight, as well. The Mississippi River in Lower Town St. Paul, went over its banks last week and since started to go down.

Often times in Minnesota and the rest of the Upper Midwest, we get these last hurrah snowstorms. Earlier this week, we received one such snowstorm.

Up until the snowstorm earlier this week, I had been working on our 2002 F150. It’s parked at the top of our driveway, in front of the garage. At the moment, it is immovable. The intake manifold has a coolant crossover, and there was a gasket that sits between the aluminum intake surface and the plastic/polymer manifold. The gasket was not sealing correctly, and a very small amount of coolant was making its way into the combustion chamber and subsequently out the exhaust. The intake manifold, vacuum hoses, alternator, injectors, ignition coils and a host of other parts are detached at the moment. I will possibly write about this adventure at a later date. The main point is that the pickup is nearly in the way for getting the GMC Savana Hound Hauler parked on the parking pad in front of our fence.

Needing to move the Savana, and in an attempt to not hit the F150, I backed up and turned away from the pickup. On the passenger side of the van, is a row of trees and brush. Not bothering to clear off the recent snow from the van, I figured, I’ve got this. I got too close to the trees and tore the passenger side rearview mirror off.

The last time we looked into replacement mirrors for this ride, we were shocked to see mirrors with power adjustment were averaging a couple hundred dollars. Not this time, when I looked on Amazon, I was surprised to find one for about $45.

Zip ties to secure the broken mirror in the mean time. Plastic shroud removed, exposing retention nuts.

The mirror unit is relatively straight forward to replace. I used a 10mm socket with an extension to remove the three retention bolts located behind a plastic shroud. There is a fairly short, but very helpful video on Youtube that explains what needs to be removed from the door to get at the wiring harness. The Houndmobile is a conversion van, as such, there is a bit more junk in the way to get four or so pieces of plastic removed, but the process was basically the same.

Likely incomplete list of tools/materials:

  • 10 mm socket
  • Socket extension (2″ to 4″ should do)
  • Socket wrench
  • Phillips-head screwdriver (#2 for the majority of the screws)
  • Small Phillips-head screwdriver (#0 for the conversation van wood piece; likely unneeded if your van is stock)
  • Flat-head screwdriver for pushing plastic clips in to release
  • Steel or aluminum wire (think electric fence thickness wire)
  • Needle nose pliers to for pulling steel wire
  • Bandaids (I poked a finger with the steel wire)

Basically, remove plastic bits around the door handle until wiring harnesses are exposed. The white connector in the photo is the fellow you will need to disconnect. The mirror and associated wiring should just pull away from the vehicle.

To get the new wiring from the mirror to the connection point, I used a very stiff piece of steel wire, fishing this up to the top left where the mirror mounts. With a bit of jiggling, I was able to get the new mirror’s wiring pulled through and connected.

This was a fairly easy, maybe 45 minute project. I probably spent more time hunting for a tiny Phillips-head screwdriver for a piece of wood panelling that needed to be removed.

Pattern Recognition as Applied to Peer-to-Peer Lending

The purpose of this article is multi-faceted:  1) to give an overview of what pattern recognition and machine learning are; and 2) the uses of machine learning in the context of and as applied to peer-to-peer lending.  This article starts out broad and meanders a bit until it gets into the topic referenced in the title.

Pattern recognition algorithms, as their name suggests, is the recognition of patterns, in data.  That is, given a set of inputs (often called features), can an output (sometimes called labels) be inferred?  If that seems a bit foreign, well, it should not be.  Humans use our own pattern recognitions all the time.  When you see a physical book, how do know it’s a book?  Maybe you recognize features of the object, it is rectangular, appears to be made of many sheets of paper, there are words on at least two of the larger flat surfaces.  Maybe it is a book!

It is likely that someone told you, at one point in your life, that that object in front of you or maybe you were holding it, is a book.  This may have happened a few times, and then, all of a sudden, you were able to recognize books.  It is very likely that this happened at such a young age, that you do not even remember this first recognition happening. Before going any further, let’s take a detour to explain the other side of this post: peer to peer lending.  If you are bit lost, and want to know a bit more about lending, reading the post on the Geography of Social Lending first, then come back to this piece.

Social Lending. When a potential borrower uses one of these systems, and says I would like to obtain a loan, the person enters from identifying information, and from a credit bureau or two, information is pulled into the respective platform.  There is a lot of information that is used to determine 1) whether or not a person has a solid enough credit history to take out a loan; and 2) assuming they qualify for a loan, what interest rate should they receive on the loan.  All of the peer-to-peer lending platforms have their own proprietary methods for arriving upon the interest rate and usually a proprietary external or resulting risk metric, in the case of Prosper Lending, the source of the data that this article will be using, this risk metric is a letter score of AA to E in addition to HR (which I usually read as high risk).

Many of the features that go into determining this risk metric are actually available.  There are roughly 400 different features that are available via Prosper Lending’s platform.  Everything from the percent of available credit on an individual’s credit cards,  how long has the individual had credit cards, whether the individual has a mortgage, does the individual have any wage garnishments, and the list goes on.  Similar to the features that allow a person to identify that a book is a book and not a baseball cap, using the features of an individual’s credit profile, it is possible to get a sense of whether the person will fail to repay the loan.  With the availability of loan data from peer-to-peer platforms, there have been a number of competitions to see who can get the most accurate predictions on whether loans will go bad.

If you have been able to follow along so far, great!  I fear, however, that I may lose some of you with the following nerd-stuff.

Mathematicians, computer scientists, statisticians, cognitive experts and many in between have developed a fascinating array of algorithms that fall into the category of machine learning.  They have fancy names like Support Vector Machines, Ridge and Lasso Regression, Decision Trees, Extra Forest, Random Forests, Gradient Boost, and many others.  There are a couple broad groups of problem spaces in machine learning: classification and regression.  Classification is pretty much inline with the non-machine-learning definition of the word, which bucket does this thing belong in?  With regression, the objective is generally trying to identify a quantity on a continuous spectrum.  An example could be, given a person’s gender, height, weight, and maybe a geographic categorization, how much alcohol is this person likely to consume in a given year?

Let’s a take look at regression, first, before looking into classification.

Length in Inches (Y), Age in Months (X)

The basic idea behind any of these regression algorithms is to effectively fit a line to a series of points.  When you have three or more dimensions in your data, you end up with planes and hyperplanes, but for let’s keep it in simple 2D space for the moment.  In a simple example, if you have a plot of points on a graph, let’s say that the Y axis is represents children’s’ height, and the axis represents age in months and you have measurements until the children were age 3 years. For the sake of example date, let’s include the children’s gender (as assigned at birth).

Looking at the graph to the right, there’s a few things you might be able to say.  Boys, on average, will be slightly longer than girls of the same age.

What if you had some other data, and it was just age of boys, but you did not have length information on this same set of children?

If you have had a bit of algebra, you might think that there is maybe an equation of a line somewhere in these data.  You might think back to y-intercepts and slopes of lines.  If you thought of this, you just stumbled upon what amounts to 2-dimensional linear regression.  As an aside, knowing that the data are of boys is simply filtering out the girls from the data, and that variable, categorical in this case, is not necessary for the following.  Remember, that the equation of a line, in slope-intercept form, is y = mx + b.

Using statistical software, like Python’s sci-kit learn, you can use our existing dataset of ages and lengths, to get a linear equation that approximates these data.

import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

def gender(x):
    if x == 1:
        return "Boy"
    elif x == 2:
        return "Girl"

df = pd.read_excel("https://www.cdc.gov/growthcharts/data/zscore/zlenageinf.xls")
df.columns = ['sex', 'age', 'z_-2', 'z_1.5', 'z_-1', 'z_-0.5', 'mean', 'z_0.5', 'z_1', 'z_1.5', 'z_2']
boys_df = df[df['sex'] == 1]
girls_df = df[df['sex'] == 2]
# this isn't necessary beyond helping me keep things straight in my head
boys_girls_df = boys_df.append(girls_df)

boys_girls_df['gender'] = boys_girls_df['sex'].map(gender)
# demetricify these data
boys_df['length'] = boys_df['mean'] * 0.393701
girls_df['length'] = girls_df['mean'] * 0.393701
boys_girls_df['length'] = boys_girls_df['mean'] * 0.393701

x = boys_df['age'].values.ravel()
y = boys_df['length'].values.ravel()

x_training, x_testing, y_training, y_test = train_test_split(x.reshape(-1,1), y)

clf = LinearRegression()
clf.fit(x_training, y_training)
predicts = clf.predict(x_testing)
plt.scatter(x_training, y_training, color='lightblue')
plt.plot(x_testing, predicts, color='red')

print(clf.coef_[0], clf.intercept_)


Age in months (X) plotted against Length in inches, with Linear Regression Plot

The above bit of python code downloads an Excel file from the Center for Disease Control, and puts the columns and rows from that spreadsheet into a table format that more easily programmatically manipulated. Picking the columns necessary for our and Y, as well as subsetting into boys and girls, and putting some labeling on columns, we end up with things that can be pushed into the linear regression machinery.  As an aside, you might have noticed the train_test_split method.  Basically, the idea before this method is to subset your available data into something you show an algorithm, and something you withhold from the algorithm until later to see how well the algorithm can predict unseen values.  This training set and testing set is important, and will be brought up, again, further into this post.

The above code will also print two values, one is the m in the previously mentioned equation, y = mx + b, and the other is the b in that same equation.

So, our red line has an equation of roughly:

y = 0.438x + 23.399

And if you are thinking that the red line is really not a very great fit for our light blue dots, you would be correct.  If you had new data from only 5 month to 30 month old boys, you could approximate the lengths, but if you had data for children outside of these bounds, you would end up approximations that were too long in the case of those very young, and approximations that were too short for those older.

The question now arises, how does one get more closely fitting line to the underlining data?  The answer ends up being, use fancier machines.  For the sake simplicity, we will skip over whole groupings of other algorithms and look at one called Gradient Boosting.  If you have had more advanced calculus, and the word gradient rings a bell, good.  That is the same gradient.  Perhaps in a future posting, I’ll jump into explaining gradient boosting, but not now.

Age in months versus Length in inches (Gradient Boost Algorithm)

If you swap out the LinearRegression in above code snippet with GradientBoostingRegressor, rerun the code, with the print statement commented out, and plt.plot(x_testing, predicts, color=’red’) replaced with plt.scatter(x_testing, predicts, color=’red’), you end up with the graph to the right.  As you can see, the red dots, which consist of ages from withheld-from-training age/length pairs, with predicted length, plot neatly within the blue dots of actual data points.

That’s a brief introduction into applying regression to a problem set.  Now, what about classifiers?

I tend to think of classifiers as still having dots of data on a graph, and continuing to find a line that fits these data.  However, instead of trying to get dots directly on the line, you try to figure out which side of the line the dot belongs to, and then you can say it belongs to category 1, or you can say, it belongs to category 2.  In the case of our example infant length and gender data, you age and length as input features to try to categorize the birth gender of a particular child.  On the peer-to-peer lending front, this equates into using those 400+ features that included with a loan’s listing to try to get the outcome of success or fail.  There is also a probability metric associated with the classification that could be thought of as how close to the dividing line is this prediction?

There are a handful of academic works that look at applying machine learning to the peer-to-peer lending space.  The one that seemed to come up the most, as I researched the topic myself for projects during my graduate studies, was a journal article from 2015; Risk Assessment in Social Lending via Random Forests, by Malekipirbazari & Aksakalli.  The abstract for that article is as follows:

With the advance of electronic commerce and social platforms, social lending (also known as peer-to-peer lending) has emerged as a viable platform where lenders and borrowers can do business without the help of institutional intermediaries such as banks. Social lending has gained significant momentum recently, with some platforms reaching multi-billion dollar loan circulation in a short amount of time. On the other hand, sustainability and possible widespread adoption of such platforms depend heavily on reliable risk attribution to individual borrowers. For this purpose, we propose a random forest (RF) based classification method for predicting borrower status. Our results on data from the popular social lending platform Lending Club (LC) indicate the RF-based method outperforms the FICO credit scores as well as LC grades in identification of good borrowers1.

This paper and a few others I have come across use Random Forest, and it seems to be the go to algorithm.  Much of the data from peer-to-peer lending platforms are categorical, but some is continuous.  Random Forests handle these heterogenous data quite well2.

The general classification idea, when applying machine learning to peer-to-peer lending data, is a binary problem:  Do loans succeed in being repaid or do they fail to be repaid?

Risk Assessment in Social Lending via Random Forest is really a great paper.  Even though it focuses on the more readily available data from Lending Club, it gives a great job of doing a brief literature review of prior papers that looked into peer-to-peer lending, as well as looking at three different machine learning approaches to the problem.  They also highlight that, within Lending Club’s data, simply relying upon the proprietary grade metric, is not necessarily indicative of a good borrower.  The concept of good borrower and their identification is, however, the one point that perplexes me.  I tend to couch the problem in a different, I want to successfully identify bad borrowers, so as to avoid lending to them.   Thinking of the problem space with this notion in mind, also highlights another characteristic of these data: imbalanced classes or categories of final loan state (success or fail).  Many classification techniques work best when the classes being evaluated are more or less even in size3.  Many canonical classification algorithms have the assumption that the outcome classes of a dataset are balanced4.   The data from Prosper lending, after having been mapped into a binary categorization of fail and success, the counts of each outcome show an imbalance across the whole dataset of 75% of the loans were successful, and 25% of loans failed.  This imbalance of classes can lead to classifier bias.  In my mind, this means that it is easier and more likely that a new, never before seen loan listing will be classified as success by your trained algorithm, then it is to identify loan listings that one should best be avoided.

This imbalance in classes leads us into the general area of preprocessing.  If you think of running data through your machine learning algorithm as processing, this is the step before that.  The algorithms are pretty sophisticated, but if the data have non-numerical values, for example, the Prosper Rating, the algorithm will not know what to do with these values; these values need to be turned into something numerical.  Likewise, with the imbalance of outcome classes, something could be done with this.

Let’s talk more specifically about what preprocessing was done on our Prosper Lending dataset.

If you have an investor account with Prosper, you can freely download two different sets of data.  Listing data, and Loan data.  Listing data sets contain the borrowers credit profile, the state they reside in, their occupation that has its value originating from a drop down menu or a set of pick-one radio buttons in the user interface, as well as a whole host of other bits of information.  Loan data sets contain a sort of snap shot in time as to the current status of a loan, if the loan is still actively being paid off.  In the case of loans that have “run their course”, this means that they were either successfully repaid, or they entered into a status of “default” or “charge-off”; both of these statuses for our purposes are considered fails.

We will only be concerned with loans that have either successfully been repaid, or have defaulted or been charged-off.  We are not concerned with active, in good standing loans.  As an aside, we might be interested in active, in good standing loans, if there was a secondary market to sell loans that are in the process of being repaid.  Sadly, Prosper shutdown its secondary market a few years, and as such, once a loan as originated, and the lenders receive notes, those notes are effectively an illiquid asset.

The key piece of information that is missing from these two sets of data is a linkage between the listing from the borrower, and a loan’s final outcome.  The field listing_number is noticeably absent from the loan datasets. A linkage between these two datasets is possible by matching the borrower’s state, the borrower’s interest rate on the loan, the date the listing was turned into a loan, and the amount of money of the loan.  This will get you nearly all the way to having a correctly set of datasets.  But, Prosper does make this linkage data available.  More data are available if you inquire and sign a data sharing agreement.  The linkage data, however, is in a terribly inconvenient format.  It is a 16GB (2GB compressed) comma separated format file that represents every payment and every payment attempt on the Prosper platform from nearly the start of the platform, up to roughly the previous calendar quarter.  This enormous file is data rich, and would allow for finer grained analysis of Prosper’s lending and borrowers’ repayment efforts, but the bit that we are interested in the listing_number <-> loan_number pairing that is found in this file.  As a side note, I tend to use listing_number (lower case with an underscore), Prosper tends to use ListingNumber (camelcase words); I will go back and forth between the two flavors, but they mean the same thing. I wrote a bit of wrap code that can read in all the zip files in a directory for the listing data and the loans data, and make Prosper’s CamelCase column names into snake_case.

Taking this enormous CSV file, and read it into a Pandas DataFrame, and select the two columns we are interested, ListingNumber and LoanID, then group by two columns:

 import pandas as pd 
df = pd.read_csv('./LoanLevelMonthly.zip') 
loans_listings_df = df.groupby(['LoanID', 'ListingNumber']).size().reset_index() 
loans_listings_df.columns = ['loan_number', 'listing_number', 'count']

So, where does one get access these data? You will need to contact Prosper via their help desk and ask for the “Prosper Data License Agreement”. Once you sign and return this agreement to them, within a day or two, you will be sent information on how to download the latest data from the platform.

Then, you will need historical listings and loans data that you will then use the above to link up the listings data and loan outcomes.  Assuming you have a Prosper account, and you are already logged in, as I mentioned above, you can just download all the historical data.

At this stage, we are only interested in data on loans that are done.  That is, loans that have either been successfully and fully repaid, or loans that have defaulted or has been declared a charge-off.  In these data, that means a loan_status of greater 1 and less than 6.

Reading in the listings zip files, you will want to simply read the first (and oldest) file into a Pandas DataFrame, and then read the next oldest, appending that file’s contents to the previously created DataFrame.  You will do this with all of the listings zip files.  Similarly, read all the loans data into a separate DataFrame, appending the newer loans to the older loans.

After reading in both the listing data and the loan data, we select only the loans with the statuses we want.

loan_df = loan_df[loan_df['loan_status'] > 1]
loan_df = loan_df[loan_df['loan_status'] < 6]
2 25244
3 90481
4 361687
dtype: int64

You should have three DataFrames at this point.  One DataFrame with the linkings of loan_number to listing_number, one DataFrame containing loans data, and one DataFrame containing listing data.

The only purpose that we link loans to listings is getting loans’ final status, we will want to select out just two columns; the loan_number and the loan_status.

loan_status_df = loan_df[['loan_number', 'loan_status']]

At this point, if we were wanting to do something like figuring out the actual, real rate of return on a loan, that is, the amount of money that the lender ultimately received, you could calculate that value at this stage.  It would look something like:

loan_df['actual_return_rate']  = (loan_df['principal_paid'] + loan_df['interest_paid'] + loan_df['service_fees_paid'] - loan_df['amount_borrowed']) / loan_df['amount_borrowed']/(loan_df['term'] / 12)

Note, service_fees_paid is negative, so we just add it to the amount to effective subtract that amount

If we did want to include actual_return_rate, we would include that in the columns we select out (above), but we would need to remember to remove actual_return_rate later, when we are attempting to classifier listings as it would effectively spike our results.  Likewise, if we were using a regressor to try to predict the actual_return_rate, you would want to remove loan_status from the mix.

But for now, we will only be concerned with looking at predicting a loan’s outcome solely on what was in the listing for that loan.  Back to our three DataFrames.

Start by merging loan_status_df to loans_listings_df.

loans_with_listing_numbers_df = loan_status_df.merge(loans_listings_df,on=['loan_number'])

This will do an inner join on the two DataFrames, and truncate off loans that are too new compared to the loans <–> listings linkage data.

Similarly for the listing data, we merge listing_df with the newly created loans_with_listing_numbers_df

complete_df = loans_with_listing_numbers_df.merge(listing_df,on=['listing_number'])

At this point, you will have a DataFrame that contains hundreds of thousands of rows of heterogeneous data. Some dates, some categorical (e.g. prosper scores, credit score range bins, and so forth), as well as continuous values (usually something dealing with a percent or a dollar amount).

There are also columns that you want to exclude or remove completely because they are present in these historic data but not present in listing data that are obtained via Prosper’s API for active listings.  We will filter, remove or otherwise transfer the DataFrame into something is slightly more usable.

def remap_bool(x):
    if x == False:
        return -1
    elif x == 'False':
        return -1
    elif x is None:
        return -1
    elif x == True:
        return 1
    elif x == 'True':
        return 1
    elif x == '0':
        return -1
def to_unixtime(d):    
    return time.mktime(d.timetuple())

def to_unixtime_str(d):
    if str(d) == 'nan':
        return  to_unixtime(parser.parse('2006-09-01'))

    return to_unixtime(parser.parse(d))

def remap_str_nan(x):
    return str(x)

def remap_loan_status(x):
    if x == 4:
        return 1
        return -1

df = complete_df.copy()
df['loan_status'] = df['loan_status'].map(remap_loan_status)
df['has_mortgage'] = df['is_homeowner'].map(remap_bool)
df['first_recorded_credit_line'] = df['first_recorded_credit_line'].map(to_unixtime_str)
df['scorex'] = df['scorex'].map(remap_str_nan)
df['partial_funding_indicator'] = df['partial_funding_indicator'].map(remap_bool)
df['income_verifiable'] = df['income_verifiable'].map(remap_bool)
df['scorex_change'] = df['scorex_change'].map(remap_str_nan)
df['occupation'] = df['occupation'].map(remap_str_nan)
df['fico_score'] = df['fico_score'].map(remap_str_nan)

df = df.drop(['channel_code', 'group_indicator', 'orig_date', 'borrower_city', 'loan_number', 'loan_origination_date', 'listing_creation_date', 'tu_fico_range', 'tu_fico_date', 'oldest_trade_open_date', 'borrower_metropolitan_area', 'credit_pull_date', 'last_updated_date', 'listing_end_date', 'listing_start_date', 'whole_loan_end_date', 'whole_loan_start_date', 'prior_prosper_loans61dpd', 'member_key', 'group_name', 'listing_status_reason', 'Unnamed: 0', 'actual_return_rate', 'investment_type_description', 'is_homeowner', 'listing_status', 'listing_number', 'listing_uid'], axis=1)

We end up with a slightly cleaner, slightly more meaningful (from an algorithm’s perspective) table of data.  However, there is still more that can be done to these data.  It would be up to you to determine if there were more steps in a pipeline that could be used to make these data more meaningful.  Other steps could include removal of outliers via an Isolation Forest5.  Dealing with the class imbalance is also something to consider.  For our final model that we developed, we used SMOTE6 for oversampling during our training phase.  That is, we used SMOTE to take training data and produce synthetic data with balanced classes from a 60% to 75% sampling of the whole data.  This leans you legitimate, actual historic data to verify (test) your model with to see how well it predicts what you are interested in predicting.

It should also be mentioned that we one hot encoded our data.  We split off the outcome column, and one hot encoded our features.  This has the effect of taking the remaining categorical columns, such as borrower_state, and making individual boolean columns for each category in that column (in the case of state, this resulted in something like 52 new columns or something like that).

The model we have actually deployed into a small, real world experiment where, based off the scorings produced with the model, there are actual listings being be automatically invested in, has a pipeline of something like the following:

Linking Dataset -> Cleaning & Translating -> One Hot Encoding -> Anomaly Detection & Removal -> Rescaling data between 0 and 1 -> Oversampling -> Training -> Verification -> Deployment 

Getting to the point of having a tuned, deployable model, should actually be the topic of another followup article.  The gist of the tuning involves a lot of brute force, grid searching with parameters.

Our final classifier is something like this:

clf_gb = XGBClassifier(n_estimators=639, n_jobs=-1, learning_rate=0.1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', scale_pos_weight=1, max_depth=9, min_child_weight=10, silent=False, verbose=50, verbose_eval=True)

We take the intermediate things, like the min max scaler object, in addition to the classifier itself, and pickle these. Before you get your nerd panties in a twist, we do realize that there are dangers to using pickled objects in python. We’ll assume the risks for now.

These pickled objects then get deployed with a thin wrapper that takes output from Prosper’s API endpoint for listings, and does translation and blows out things to required columns (remember, we one hot encoded things, so, there are a lot of columns to fill out). This then gets run through the model to produce a scoring of success and failure probability.

If you recall, above, that the one thing that we were ultimately concerned with was identifying loans that will most likely fail. Likewise, in that same statement, there is an implicit desire to minimize the number of loans that failed but were classified as being successful during our testing phase. Here is the confusion matrix from our testing phase:

array([[ 8022,  6080],
       [24031, 90805]])

That 6,080 is smallest number in the bunch, we interpret this a good sign. To further possibly address the likelihood of investing in a falsely labeled listing, we also filter on the success probability. Something like exclude listings with a score under 0.85.  This still won’t catch truly out of whack listings, but we hope it would help.

Useful Links & Things

Geography of Social Lending

Over the last few years, the idea of applying computer assisted pattern recognition, or more commonly known as machine learning, to social lending has sort of stuck with me.  Somewhere in 2015, myself and a colleague,  first looked at this problem space.  I may write about the machine learning aspect in a future blog post, but that is not the focus in this piece.  It was not until recently that I began to think of lending in the context of geography.  Could visual patterns be teased out from the available data?  There is an existing article on the topic, but the granularity of the analysis is at the state level.  Similar to that geographical analysis of Prosper, there’s also a look at Lending Club at the ZIP3 level.  I wanted to get to a smaller unit of political geography.  Before I get into this, let’s give some context to what exactly Social Lending is all about.

The basic idea with social lending is a person who wants or needs a bit of money. The person makes a listing for a loan use an online platform like Lending Club or Prosper, instead of going directly to a bank. Social lending, more commonly known as peer to peer lending, sells the idea that it offers opportunities for both borrowers and lenders to reach their own objectives outside of direct interaction with banks. Lenders, big and small, have a potential opportunity to put their money to work, while borrowers are able to access money through an alternative to traditional bank loans and credit cards.  As with many transactional things in the era of the Internet, both lenders and borrowers fill out forms via web pages on the respective platforms.  To give background to the size of the peer to peer lending industry, by early 2015, the largest peer to peer lending platform, Lending Club, had facilitated over six billion dollars worth of loans1.  With an active listing on one of the platforms, potential investors in the loan that may result for the listing, review the listing’s information and decide whether to commit some amount of money to the final loan.

A bit of the appeal of peer-to-peer lending, along with being an alternative source of money for borrowers who might have difficulties accessing credit through other channels, is how these loans are securitized into notes and presented to investors.  Let’s take a quick detour into securitization.

The basic idea of securitization is to take many financial obligations, e.g. loans or debts, pool them together into an even larger thing, and then chop that larger thing into small pieces.  The small pieces are then sold to investors, who expect an eventual returning of their initial investment with interest payments along the way. Securitization has been around for a long time.  In the 1850s, there were offerings of farm mortgage bonds by the Racine & Mississippi railroad. These farm mortgage bonds had three components 1) the note, which stated the financial obligation of the farmer to repay the stated mortgage amount; 2) the mortgage, which offered the farm as collateral; and 3) the bond of the railroad, which offered its reputation for repayment in addition to other other assets. 2  In the 1970s, the Department of Housing and Urban Development created the first modern residential mortgage-backed security when the Government National Mortgage Association (Ginnie Mae or GNMA) sold securities backed by a bundled mortgage loans3.  There also a fascinating looking-back-in-time at a moment in securitization history in a Federal Reserve Bank of San Francisco Weekly Note from July 4, 1986.

The peer-to-peer lending industry, with a focus on everyday people who want to invest in these loans (as opposed to large banks, and private equity investors) is slightly different in how loans are securitized.  Instead of bundling many, multi-thousand dollar loans into a pool, and then dividing the pool into notes, a single loan, for example, in the amount of $10,000, is divided into notes in denominations ranging from $25 up to thousands of dollars.  An investor could buy a single $25 note, or she could buy a larger percentage of a given loan.  As an aside, a widely held objective in investing is to maximize return on investment and reduce risk.  A diversification of the risk is supposed to be achieved by buying a slice of many different loans4.

Let’s get back to the topic at hand: the geography of social lending.  First, the data.  I will be using data from Prosper.  There’s a tremendous amount of work behind getting these data into a shape and structure that lends itself to both looking at things geographically, as well as simply getting historical data that matches the data for the listing that borrower made with the data for the loan that was made following the listing.  This process involved first having an investment account with Prosper, and then applying for an additional level of access for finer grained data.  Without the finer grained level of access, the problem becomes an issue of record linkage; tying listing data to loan data based on the interest rate of the loan, the date of the loan’s origination, the amount of money the loan was for, the state of the borrower, and a couple other characteristics.  It is fairly accurate, but if one is able to get true listing to loan matches, just use that.

Location. Location. Location.

Contrary to what was said in the Orchard Platform’s article on geography and Prosper, locations at a finer resolution than state are available.  There are, however, a couple caveats.  The first being, it is the text in this field (borrower_city) is freeform and entered by the borrower.  There is no standardization.  You might get a chcgo, a chicgo, or the actual proper noun spelling, Chicago, for the city’s name.  It also appears that entering a city name might be optional, as there are some listings with an empty city.  The other caveat for borrower_city, is that it is available only in the historic data downloads, and not available via Prosper’s API.  Why is a finer grained location interesting?  Because, if you were an investor, you might want to include a prospective borrower’s city in your judgement on whether or not to invest in a loan.  I won’t trust those Minneapolis borrowers.  In my mind, this actually the reason this information is suppressed at the time of an active listing.  There are laws and regulations in the US that state lenders are not allowed to discriminate based on age, sex, and race.  Fair lending laws have been on the books since the 1960s and 1970s5, and so lenders have been keen to avoid perceptions of discrimination based on these characteristics.  Even so, both Prosper and Lending Club, in their early days, had pieces of information shared by the prospective borrowers.  Things like a photo of the borrower along with a message from the borrower were posted in the listing.  Photos could leave an impression of age and race, while the notes often included references to the person’s spouse with associated pronouns6.  Both Prosper and Lending Club have the exact addresses for successful borrowers, there are know your customer rules and regulations, after all.   By not exposing this sliver of information at the time of an active listing, the lending platforms are potentially covering themselves from both actual discriminatory liability, as well as perceived public relations issues (that doesn’t mean that one of these platforms does not periodically have both — likely a paywall on that link, by the way).

At the start of the last paragraph, I mentioned the messiness of these free form city names.  How does one cleanup these data into a normalized, relatively accurate location?  Google.  Google, through its cloud services business, offers relatively good name standardization, and geocoding services.  So, putting chcgo, or chicgo into their system, results in Chicago, IL with a bunch of other information, like the county it is located in, as well as latitude and longitude information for both a bounding box around the entity as well as a centroid.

The Google geocoding service, I should add, after a point, it is not free.  Up to 2,500 uses, there is no charge, for each additional 1,000, it is $0.50.  With a total of 477,546 loans with associated listing data, this seemed potentially expensive.  Instead, I collapsed down the borrower’s city and state into a unique value, and fed that into the geocoding service.  Getting a unique set of city and state combinations significantly reduced the number of things that I would need to geocode; from nearly 478,000 individual loans, down to about 22,000 combinations.  These standardized city/state/coordinates are then reattached to the original data.  Not every user entered city was able to be identified.  Entries like chstnt+hl+cv, md and fpo were not identified.  FPO and APO (also found in these data) are military installations, Fleet Post Office and Army Post Office, respectively.  The loan/listing entries with locations that could not be identified via Google’s Geocoding Service were removed from these data, resulting in less than 10,000 listings, or 1.9% of the total, dropping off.

I should also give some temporal context to these data; the data range in dates from November 15, 2005 to January 31, 2018.

With a collection of finer grained locations (of unknown quality, I should add), what questions can be visualized with these data?

Orchard Platform’s article on geography of peer to peer lending, as you recall, looks at state level aggregations of data.  The piece looks at choropleth maps of loan originations by volume, loan originations per capita, loans with 30 days or more past due, and finally a map of normalized unemployment rates.

The two maps, above, are originations by place at a city level.  It is effectively showing nothing more than where people live.  It’s a population map.  It is what someone should expect.  You will see more loans originating from the Los Angeles or New York City area than the Fargo, ND/Moorhead, MN area.  There’s just more people (much higher population densities) in the first two metropolitan areas than in the latter; each of those two higher population metropolitan areas are also spatially larger.  The New York metropolitan area, for example, is 13,318 square miles, while the Fargo/Moorhead area is only 2,821 square miles.

Even looking at just failed loans, which one of the above maps does, is still only identifying where populations live.

What if you wanted to look at loan originations and whether there appears to be a concentration within counties in the US that a significant proportion of a county’s population is African American?

First you would need data on race, at the county level in the United States.  The US Census Bureau’s American Community Survey is a great source for this type of information.  In addition to data on race, you need this information tied back to a counties or census tracts or states.  There’s a product made by the Institute for Social Research and Data Innovation, called the National Historic Geographic Information System, just NHGIS7.  Along with the census and survey based data, NHGIS has

ESRI shapefiles available that tie the data to place spatially.  These are the two things needed.

The above map, with its blue Prosper loan locations, and the red colored choropleth, representing the percent of a county’s population that is African American, on the surface is interesting looking, but it is really only showing where a segment of the greater population live.

I posed question of race and lending to a colleague of mine, and he thought on it for a short time, and then suggested looking at a choropleth of the number of loans in a county divided by the percent of minorities in a county.

First, define what is meant my minority.  In the case of the following, I simply defined this as not white.  The 2010 US Census found that White – Alone made up 72.4% of the US population8.  Whether or not combining all non-white populations into a single number is the correct thing to do is another story.

In the map to the right, the scatter plot of locations of borrowers is gone, and instead, what is the loan count divided by the ratio of non-whites in a given county.  It is another way to slice the data.  However, it also seems to just be identifying more diverse populations.  Los Angeles, Seattle, Chicago, Boston, Las Vegas, and Albuquerque, for example.

Another way to spin the question is to assume, for a moment, that the loans are evenly distributed throughout a county’s population.  If a county was 80% white, 15% African American, and 5% Native American or Alaskan Native, we could assume that 80% of the loans were taken out by white individuals, 15% were taken out by African Americans, and 5% were taken out by Native American or Alaskan Natives.  I highly doubt this is the case.  It would be possible to get a closer idea by looking at county subdivisions and where the geocoded cities are located within those.

So, taking the idea that things are evenly distributed, you allocate a portion of the loans to non-whites, or one could even look at the individual race groups in the American Community Survey.  This proportioned loan count is then divided by total number of non-whites in the county.  This should have the effect of dampening high loan counts but low non-white populations.

In the map to the left, there are still some larger, more diverse population centers picked up.  Los Angeles, San Fransisco and the Bay Area counties, Las Vegas, Atlanta, Chicago, and Houston.

In addition to this larger population areas, places like Arapahoe County, Colorado, which is directly east of Denver, shows up.  Mahnomen County in Minnesota’s northwest area also shows up.  There’s also the curious ring around the Washington D.C. area, too.

One final map.  Let’s take the same map as the previous, but let’s narrow the focus to loans that ultimately were not repaid; that is the number of loans, weighted by the ratio of non-whites in a given county, divided by the total county population.

I could keep slicing and dicing things and coming with more choropleths, but I won’t.  For a broader look at race and money, Propublica has a fascinating look at bankruptcy and race — Data Analysis: Bankruptcy and Race in America.  This report states that Memphis, Tennessee, and Shelby County, where Memphis is located in, have had the highest bankruptcy rate per capita in the nation.  It is curious to see that Shelby County, Tennessee, Desoto and Tunica counties in Mississippi, as well as Crittenden and Saint Francis counties in Arkansas all show up in the above map.  These are all counties that are part of the greater Memphis area.

That’s it for now.

Other ideas I have had with regard to Prosper data includes looking at whether given a borrower’s credit profile and state, can the county they reside be sussed out via pattern recognition (e.g. machine learning).  I will write, at some point, about a simpler application of machine learning: attempting to predict loan failure or success.

Cherry & Walnut Desk

Twelve or thirteen years ago, I had the thought, I need a desk.  Most rational, and retail-centric individuals would have traveled to a furniture store, engaged in conversation with a salesperson, possibly been convinced of the merits of a particular desk, and subsequently completed the sale with the exchange of money for the promise of a desk being delivered at some later date by two, slightly hungover individuals in a large box truck.

I picked up a wood working magazine, instead.  It was around this time, with the use of a friend’s wood shop and a couple hours of his time each Tuesday, that I had finished up a queen-sized, Mission-style oak bed frame.  I was hankering for another project.  A desk seemed reasonable.

I did not follow through the reasonable idea of taking ready made plans from a woodworking magazine.  Instead, I used them as a guide for things like height and depth.

You might be wondering, why am I bringing up a project that is over a decade past its completion?  There are a couple reasons.  The first being that I recently disassembled the desk to move it to another room in the house, and the second, and coincidentally, I came across an archive that contained the bulk of my notes, all of the AutoCAD drawings, and a software script (crude, albeit effective) for figuring out some golden ratios with regard to board widths that would constitute the desk’s main surface top.

The disassembly, and reassembly of the desk was interesting to me because it allowed me to better inspect the joints and such, as well as replace the drawer slides on the center drawer.  When we moved to a different house in 2012, and the desk was disassembled, the original drawer slides on the center drawer broke, and the replacement just never really quite worked well, and it did not extend far enough to make the drawer fully useful.

The design and construction of the desk was a bit of rolling effect.  I would design and draft up plans for a side panel or a drawer front, and my friend and I would spend a Tuesday evening jointing, planing and sawing the pieces of wood that would be necessary for that piece.

Shellac to Alcohol Ratios

I spent a lot of time tinkering with AutoCAD.  It was really quite enjoyable, and it allowed me to use some of the drafting skills I had learned while in high school.  During high school, the thinking was that future career plans would be some sort of mechanical or civil engineering, and drafting might be useful.  Education and career track ultimately did not follow the physical engineerings, but wandered down the path of computer science and the engineering of software, but I still feel that all the drafting and CAD I took in high school was well worth the time and effort.

In addition to picking up a legitimate copy of AutoCAD (I was a student at the time, so, I took advantage of AutoDesk’s educational discount program), I picked up a wide body inkjet printer.  This made working with the plans in the shop more readable.

The desk was designed to unassembled from time to time.  The center drawer, with the correct slides, is removable; the desk top can be removed after removing bolts that hold it to angle iron (see photos below) on the inside edge of the top of the drawer assemblies; the front (opposite where you sit) is removable by unscrewing four brass wood screws.  All of the drawers in each side can be removed to lighten the weight; if you are curious, I used Blum full extension slides.  A little bit more about the materials and supplies, I used:  the finish is 4 coats of shellac with several coats of marine grade varnish over the shellac. Twelves out from the finishing coats, and there are no signs of sun damage to the finish.  The wood, cherry and walnut, were from a friend and his family.  He has appeared in many blog posts of the years, from showing up in photos of gardening, snowshoeing into a Minnesota state park, to he and I traveling to arctic Canada, to me chronicling a cross-country road trip to his wedding.  Alas, the supply of cherry, walnut, oak, and others dried up when his parents left Minnesota.  Much of the other wood, like luan plywood and such, that was used in the desk came from local big-box lumber yards.  All of the drawers are also lined with physical stock certificates.  There are certificates for Marquette Cement, Massey – Ferguson Limited, Chemsol Incorporated, as well as dozens of others.  All of these certificates were purchased off of eBay.

Even though the finish on the desk is holding up quite well, the top has had a small bit of damage.  As the wood has continued to dry out, a lengthy crack has appeared in the top.  It is, however, in a location that does not impact functionality.  Aside from the crack, there was some shrinkage that was causing several of the drawers to no longer be aligned quite right.  In order for the drawers be to fully closed, the drawer had be lifted up slightly.  All of these drawer issues were resolved as I reassembled the desk in its new location.

Finally, if you are curious about the plans and possibly making your own fancy, overly complicated desk, the plans (most in PDF, but others in AutoCAD’s DWG format) are linked below.  The plans are released under a BSD-3 Clause like license.

The little bit of clunky software is also linked before; instructions on how to run the perl script are at the very bottom.


File: desk-plans.zip (5MB)

File: table_layout.pl_.zip (4KB)




table_layout.pl is a simple script that can calculate various
options for construction of a table-top.  It assumes that you want a
wider center board with narrow, even-counted boards on each side of the
center board.

Usage:  ./table_layout.pl --width=FLOAT [--widecnt=INT] [--optimal]


./table_layout.pl --width=30.75 --widecnt=5

For a table with width 30.3/4" with 5 of the wider center/edge pieces.
The third option, 'opt', will cause table_layout to try to order the solutions
in what it thinks is optimal - this feature is as of yet unimplemented.