Return to JUNTO

JUNTO Practice: Data Analysis, Chicago Buildings (part 2)

Discussed on September 22, 2020.

Datasets:

Continue your analysis from the previous meeting. Expand on your original work or find new problems to work on.

Solutions

Click to see:

Oscar Martinez

JUNTO Chicago Permit Data 2

Do Word Embeddings Contain More Info Than One-Hot Encodings?

Last time, we discovered that even with the sparse representation of the text in the WORK_DESCRIPTION column, we could do a better job of estimating the reported cost. This week, we wanted to continue that work by trying to predict the reported cost using word embeddings.

Word Embeddings

Word embeddings are dense vector representations of words. These representations encode relationships between words as they appear in the text corpus used to train/create the embeddings.

To start, we will attempt to use pre-trained embeddings from the spaCy library.

import spacy
import en_core_web_lg

nlp = en_core_web_lg.load()
processed_words = list(
    nlp.pipe(
        modeling_permits[
            ["log10_fees", "WORK_DESCRIPTION"]
        ]["WORK_DESCRIPTION"],
        disable=["tagger", "parser", "ner"],
        batch_size=50,
    )
)

word_vectors = [wv.vector for wv in processed_words]
word_vectors = pd.DataFrame(word_vectors)

This returns a 300-dimensional vector representing the work description. We concatenate this to our original input, replacing the one-hot encoded work description.

Again we train a support vector regressor to predict the log transformed reported cost from our word vectors. This time we don't need to use the pipeline because we're using the numerical vectors as inputs.

reg = svm.LinearSVR(max_iter=3000)
reg.fit(X_train, y_train)
reg.score(X_test, y_test)
0.650037878656426

Evaluating Results

Unfortunately, this is worse than our one-hot encoded R-Squared result of 0.72.

There are a number of potential problems with our naive use of spaCy's embeddings.

To improve on these, in the future, we might want to:

How Well Can We Predict Permits From Word Description Alone?

Previously we focused on reported cost. Unfortunately, there is reason to believe the cost reported on permit applications is not a useful measure of truth because of the incentives at play for the applicant.

We now turn our attention to the actual permit fees and type of permit applied for.

We want to test:

Predicting Permit Type

We begin by trying to categorize the type of permit using one-hot encoded vectors to represent the work description. The pipeline for the data is similar to before except we use a support vector classifier instead of a regressor, since the task has changed.

text_clf = Pipeline([
    ('text', ColumnTransformer(transformers=[('WORK_DESCRIPTION',
         Pipeline([
         ('vect',CountVectorizer()),
         ('tfidf',TfidfTransformer()),
         ]),
         'WORK_DESCRIPTION')],remainder='passthrough'),
    ),
    ('clf', svm.LinearSVC(max_iter=5000))
])

text_clf.fit(X_train,y_train)
text_clf.score(X_test,y_test)
0.9640369221893704

We are able to predict the permit type with a 96.4% accuracy on the testing data. This suggests the simple choice of words is predictive of the type of permit being applied for.

Problems

While the 96.4% is encouraging, it is also suspiciously high. Unless we know more about the generation of the WORK_DESCRIPTION column and how it might encode the form-specific information on the different PERMIT_TYPE forms, we cannot be sure we'd be able to predict the requisite permits from a description alone. For example, it could be the case that the work description for all elevator permits includes a specific set of answers to a checkbox on the elevator permit form. Our model would then be able to use this to overfit based on that, rather than on the presence of the word elevator.

Further work should:

BONUS ROUND and Scratch Work

Predicting Total Permit Fees

Given our model's high performance on classifying permits based on descriptions. It may be better at predicting permit fees than reported costs. In particular, we may find permit costs to have a stronger relation to description as they are generated by a review of an application that includes a work description.

I completed some preliminary work on this using a similar methodology to last week (log transformed TOTAL_FEE and one-hot encodings for WORK_DESCRIPTION) and ultimately attain a root mean squared error that corresponds to a factor of 2 deviation from the truth, on average.

More research would need to be done to evaluate how useful this sort of prediction is compared to the available calculator online.

While I didn't have time to write-up my analysis, some scratch work is included below. For each PERMIT_TYPE I report the mean and standard deviation of TOTAL_FEE for that permit group as well as the 10th, 50th, 90th, and 99th percentiles of TOTAL_FEE. I then randomly sample 5 permits from each group and report the predicted cost and the true final cost in original (non-log transformed) terms.

'''
===NEW GROUP===
PERMIT - EASY PERMIT PROCESS
337.2890527846054
6109.386357665037
0.10      70.0
0.50     250.0
0.90     550.0
0.99    1375.0
Name: TOTAL_FEE, dtype: float64
[310.59043526 191.92247361 349.86647242 169.84179651 449.11951939]
        log10_fees
584230       550.0
430185       200.0
632897       550.0
344317       100.0
7081         575.0
0 : REVISION TO PERMIT NO. 100763465; CHANGE OF HVAC TO SUN MECHANICAL SYSTEMS
1 : CHANGE HVAC CONTRACTOR TO K&F COOLING FOR PERMIT 100282299
2 : REPLACE THE EXISTING FIVE TON RTU WITH ONE CARRIER 5 TON RTU, SELF CONTAINED, AIR COOLED, R410, LOCATED ON THE ROOF. THE RTU SERVES THE SECOND FLOOR.
3 : REPLACE 68 WINDOWS.  ALL WORK SAME AS EXISTING.
4 : REPLACE A FEW BRICKS / SPOT TUCKPOINT

===NEW GROUP===
PERMIT - ELECTRIC WIRING
99.12099523896315
465.1077306641869
0.10     40.0
0.50     75.0
0.90    150.0
0.99    675.0
Name: TOTAL_FEE, dtype: float64
[219.03726265  53.16560261 106.1888374  111.86819569 103.11903806]
        log10_fees
46266        100.0
15119         50.0
620138        75.0
68417        215.0
579428        75.0
0 : temp 400 amp service
1 : INSTALL 1" OR 2" EMPTY CONDUIT FOR AT&T FIBER OPTICS.
2 : LOW VOLTAGE ALARM
3 : ELECTRICAL INSTALLATIONS AND REVISIONS OF THE CART RETURN CORRIFOR TO ONE STOP EXSPRESS.PASSAGEWAY AT TERMINAL 5. WORK IS ASSOCIATED WITH BUILDING PERMIT NUMBER 100485355
4 : INSTALLATION OF LOW VOLTAGE BURGLAR ALARM

===NEW GROUP===
PERMIT - ELEVATOR EQUIPMENT
713.5431690821256
1688.7435975594692
0.10     100.0
0.50     305.0
0.90    1430.0
0.99    7263.0
Name: TOTAL_FEE, dtype: float64
[ 278.69147449   99.88069882  405.02060757  704.88019834 1193.64505838]
        log10_fees
23090        270.0
217481       100.0
38568        270.0
622595       425.0
440097      1275.0
0 : Installation Of One (10 Harmar CPL 600 VPL With Safety Landing Gate At Rear Deck Of Residence 750lb.cap, 10 FPM, With ADA Hands Free Emergency Phone. Pursuant To The Plans Sumbitted.
1 : Install One (1) 750lb.cap. Vertical Wheelchair Lift on Rear of Home ADA Phone Required. Pursuant to Plans Submitted.
2 : INSTALL ONE HYDRAULIC PASSENGER ELEVATOR AS PER PLANS
3 : FURNISH AND INSTALL ONE VERTICAL WHEELCHAIR LIFT
4 : INSTALL NEW DOOR RESTRICTORS,NEW LIGHT CIRCUI, SAFETY TESTS

===NEW GROUP===
PERMIT - FOR EXTENSION OF PMT
1582.9296551724137
1982.6095143265418
0.10      73.5000
0.50     823.3600
0.90    4587.5250
0.99    7051.2025
Name: TOTAL_FEE, dtype: float64
[ 174.77882012  494.55939623  164.93942334 1990.50896332  809.10091406]
        log10_fees
23544         0.25
86389      6658.38
116014      122.50
89821      1109.11
145004       92.25
0 : REINSTATE TENT PERMIT FOR THE SHAMROCK SHUFFLE 2006 FOR THE DATE OF 4-2-06
1 : reinstate permit 1048964 for a 60 x 180 tent for oyster with the oyster from 10-19-06 to 10-21-06
2 : FIRST EXTENSION OF PERMIT # 100107982 AT 25% ORIGINAL PERMIT FEES. NO CHANGE TO DESCRIPTION OF WORK : ERECT NEW 2 STORY 4A  SFR W/DETACHED GARAGE AS PER PLAN
3 : FIRST EXTENSION OF PERMIT # 100090817 ISSUED 02/07/06 AT 25% ORIGINAL PERMIT FEE. NO CHANGE TO DESCRIPTION OF WORK : INTERIOR ALTERATIONS TO UPGRADE ELECTRICAL MECHANICAL & PLUMBING AS PER PLANS
4 : FIRST EXTENSION OF PERMIT #100129086 AT 25% OF ORIGINAL COST NO CHANGE TO DESCRIPTION OF WORK : erect 2 story sfr with rear deck, roof deck, and detached 2 car frame garage as per plan.

===NEW GROUP===
PERMIT - NEW CONSTRUCTION
7244.593596940957
37581.57409388481
0.10       325.000
0.50      2150.000
0.90      9603.890
0.99    114347.364
Name: TOTAL_FEE, dtype: float64
[1149.9864995  1087.87652514 1476.99346442 2131.25936033 1344.82017411]
        log10_fees
102692     4743.75
567652     7984.04
58907      1817.75
373001     1625.00
436937     3590.00
0 : PERMIT EXPIRES ON 09/08/2015 ERECTION STARTS: 7/7/2015, ERECTION ENDS: 7/9/2015.  REINSTATEMENT OF PERMIT 100549306 TO ERECT 60' X 100' AND 20' X 60' TENTS FOR EVENT JULY 8, 2015
1 : TELECOMMUNICATION SITE WITH EQUIPMENT ON CONCRETE PAD WITH ANTENNAS ON NEW 100' MONOPOLE
2 : ERECT (1) 15' x 30', (1) 40' x 40', AND (1) 60' x 82' TENTS, FOR THE DUSABLE MUSEUM FOR EVENT ON APRIL 1, 2011. INSTALL 3/29/10 AND  REMOVE 4/4/11.
3 : ERECT A 2 STORY, IVA FRAME SINGLE FAMILY RESIDENCE WITH BASEMENT,  AND  DETACHED 3 CAR GARAGE WITH ROOF DECK PER PLANS
4 : PPROPOSED 2 STORY FRAME SINGLE FAMILY RESIDENCE WITH BASEMENT AND DETACHED 2 CAR FRAME GARAGE AS PER PLANS.

===NEW GROUP===
PERMIT - PORCH CONSTRUCTION
228.79934775589282
135.9753682451586
0.10    135.0
0.50    185.0
0.90    350.0
0.99    765.2
Name: TOTAL_FEE, dtype: float64
[446.47015399 348.60692786 116.8850952  215.41215804 245.86964149]
       log10_fees
27098      285.00
47074      210.00
25931      235.00
83239      176.25
32512      210.00
0 : REPAIR OF EXISTING (4) PORCHES LOCATED AT 5030-5036 SOUTH EVANS AVE. THE WORK INCLUDES ENHANCING OF FRAMING SYSTEM TO COMPLY W/CURRENT CITY OF CHICAGO STANDARDS.                                                                                                                               -CONDITIONAL PERMIT, SUBJECT TO FIELD VERIFICATION.
1 : REPLACE EXISTING 2 STORY REAR ENCLOSED PORCH WITH DOH PROTOTYPE.
2 : REBUILD BACK PORCH AS PER PLANS
3 : CONSTRUCTION OF A NEW SINGLE STORY WOOD DECK SAME AS EXISTING AS PER PLANS
4 : Replace rear open wood stair system in kind

===NEW GROUP===
PERMIT - REINSTATE REVOKED PMT
1046.3014823261117
3454.7208947492527
0.10       90.040
0.50      281.250
0.90     1939.890
0.99    12463.953
Name: TOTAL_FEE, dtype: float64
[282.02981197 117.21368541 112.54734808 175.64466113 142.10276903]
        log10_fees
424339      755.50
222026      150.00
90253       112.50
171604      362.50
369445      142.19
0 : REINSTATEMENT OF PERMIT #100340466 TO ERECT SPECIAL EVENT TENT (2) 40' x 80' FROM 7/4/2011 TO 9/4/2011 - FOR 60 DAYS ONLY                                                                                                  PERMIT EXPIRES ON 9/4/2011
1 : REINSTATEMENT OF PERMIT #100292530 TO ERECT SPECIAL EVENT TENT (1) 60' x 60' FROM 5/2/10 TO 07/2/2010 - FOR 60 DAYS ONLY                                                                                                  PERMIT EXPIRES ON 07/2/2010
2 : REINSTATE PERMIT 100191228 AT 50%. OWNER AND CONTRACTORS TO REMAIN THE SAME.
3 : REINSTATEMENT OF PERMIT #100340467 TO ERECT SPECIAL EVENT TENT (1) 60' x 60' FROM 7/4/2011 TO 9/4/2011 - FOR 60 DAYS ONLY                                                                                                  PERMIT EXPIRES ON 9/4/2011
4 : REINSTATEMENT OF PERMIT #100340463 TO ERECT SPECIAL EVENT TENT (10) 20' x 20' FROM 7/4/11 THROUGH 09/4/11  - FOR 60 DAYS ONLY                                                                                                 PERMIT EXPIRES ON 09/4/2011

===NEW GROUP===
PERMIT - RENOVATION/ALTERATION
2398.085947746459
19770.884603309925
0.10      200.0000
0.50      800.0000
0.90     3825.0000
0.99    28229.8246
Name: TOTAL_FEE, dtype: float64
[ 806.56658129  624.01608924 1514.58805649  589.4067932  1085.86985815]
        log10_fees
253257     1575.00
217008      375.00
568158      892.49
311639    14825.00
474200     2825.36
0 : PORCH REPAIRS
1 : ADDITION OF STAIR TO CONNECT UNITS 27A WITH 28A, BATHROOMS AND KITCHEN REMODEL.
2 : RECONFIGURE INTERIOR STAIRWELL, RECONFIGURE EXISTING BATHROOMS, AND RELOCATE DOOR FOR EXISTING BARBER SHOP AS PER PLANS. NO HVAC & PLUMBING WORK UNDER THIS PERMIT.
3 : SIDE STAIR REPLACEMENT TO CORRECT VIOLATION PER PLANS
4 : REMOVE AND REPLACE 6 EXISTING ANTENNAS WITH NEW TECHNOLOGY AND ADD 1 NEW EQUIPMENT CABINET TO AND EXITING WIRELESS COMMUNICATIONS FACILITY AS PER PLANS.  SITE #CH60XC129

===NEW GROUP===
PERMIT - SCAFFOLDING
184.20639979602242
197.83836836490553
0.10     50.0
0.50    100.0
0.90    400.0
0.99    900.0
Name: TOTAL_FEE, dtype: float64
[86.43206368 74.20002061 55.85251088 55.40651441 35.53221446]
        log10_fees
372367       100.0
277412       400.0
118612        50.0
176284       200.0
355462       600.0
0 : ERECT ONE WERNER SWING STAGE SCAFFOLD FROM 09/28/2010 - 10/28/2010
1 : ERECT 1 SCAFFOLD FROM 4/6/12 TO 4/6/13
2 : ERECT SIX SKYCLIMBER SCAFFOLDS FROM 10/01/2008 - 10/01-2009, NO OTHER REPAIRS ON THIS PERMIT.
3 : ERECT 2 SCAFFOLDS FROM 11/3/14 TO 11/3/15
4 : ERECT 1 SCAFFOLD FROM 9/16/14 TO 9/16/15

===NEW GROUP===
PERMIT - SIGNS
291.8125620140688
206.67145044804147
0.10     119.5
0.50     250.0
0.90     400.0
0.99    1200.0
Name: TOTAL_FEE, dtype: float64
[288.45938182 217.15004343 306.82055962 244.10472127 239.18584651]
        log10_fees
567939       300.0
207707       250.0
482102       250.0
541817       250.0
232278       250.0
0 : NON-ILLUMINATED ALUMINUM BLADE SIGN WITH DOUBLE-SIDED VINYL GRAPHICS
1 : NON-ILLUMINATED ALUMINUM BLADE SIGN WITH DOUBLE-SIDED VINYL GRAPHICS
2 : INSTALLATION OF (1) NON ILLUMINATED AWING TO BE MOUNTED TO THE EAST BUILDING ELEVATION.   AWNING TO READ ELIMINATING RACISM EMPOWERING WOMEN YWCA METROPOLITAN CHICAGO
3 : REMODEL THE EXISTING PYLON SIGN TO REMOVE THE TWO TENANTS NO LONGER IN THE SHOPPING CENTER AND INSTALLATION OF (1) NEW EMC TO BE MOUNTED IN THE EXISTING SIGN CABINET.
4 : INSTALLATION OF VINYL FOR SUITE B19

===NEW GROUP===
PERMIT - WRECKING/DEMOLITION
388.13682893617016
709.8742387699102
0.10     110.00
0.50     300.00
0.90     500.00
0.99    1692.92
Name: TOTAL_FEE, dtype: float64
[163.62839374  91.80030635 254.51155982 245.9460905  224.33865934]
        log10_fees
90117        110.0
317169       300.0
407269       300.0
530475       500.0
292410       300.0
0 : WRECK & REMOVE 2 STORY BRICK RESIDENTIAL STRUCTURE
1 : WRECK & REMOVE 1 STORY FRAME STRUCTURE & TRAILER
2 : DEMOLITION AND REMOVAL OF A 2 STORY FRAME RES BUILDING.
3 : WRECK & REMOVE EXISTING CONCRETE PLANT.  PLANT C. GARAGE.
4 : WRECK AND REMOVE A 1 STORY STEEL FRAMED MASONRY BUILDING.
'''

John Lekberg

In the City of Chicago, does the distribution of permit processing times change year over year?

In the City of Chicago, most construction work requires a building permit. In my previous research, I created a list of rules that described the distribution of processing times for a given year, with the goal of being able to predict the processing times for future years. But, I made an untested assumption that the distribution of processing times remained the same, year over year. The goal of this research is to test that assumption. I find that the distribution of processing times changes year over year for most of the permits. This means that the rule list previously created may be useless. Future research should quantify the change in distributions to assess the practical implications on the rule list.

Introduction

In Chicago, most construction work requires a building permit. If I need a permit, I want to know:

  1. How much will it cost? (Cost question.)
  2. How long will it take to be processed? (Time question.)

The City of Chicago answers the cost question by providing a calculator on their website. ("Calculate the Cost of a Building Permit", n.d.)

I couldn't find an answer to the time question, so I created a list of a rules that describe the distribution of processing time based on permit:

(Based on 2019 data.)

However, when I built this rule list, I assumed that the distribution of processing times stayed the same year over year. If this assumption doesn't hold, then this rule list may not be useful, because describing the previous year's permit processing may not help you predict the

The goal of this research is to test my assumption that that the distribution of processing times stays the same, year over year.

Methods

The City of Chicago provides a dataset of building permits ("Building Permits", n.d.). I downloaded the dataset as a comma-separated value (CSV) file and imported it into a SQLite (Hipp et al. "SQLite") database, permit.db:

$ sqlite3 permit.db
> .mode csv
> .import Building_Permits.csv Permit_Raw

I identified 4 columns of interested in Permit_Raw:

permit#
Tracking number assigned at beginning of permit application process.
permit_type
Type of permit
processing_time
Number of days between application_start_date and issue_date.
issue_date
Date when City determined permit ready to issue, subject to payment of permit fees.

(Column descriptions are from "Building Permits" [n.d.].)

Based on these columns, I created a new table, Permit_Clean, that is a cleaned up version of Permit_Raw. Here's the schema:

CREATE TABLE Permit_Clean (
    id_pmt            TEXT     NOT NULL,
    cat_pmt_type      TEXT     NOT NULL,
    amt_pmt_proc_day  INTEGER  NOT NULL,
    dt_pmt_issue      TEXT     NOT NULL,
    
    PRIMARY KEY (id_pmt),

    CONSTRAINT 'cat_pmt_type non-empty'
        CHECK (cat_pmt_type <> ''),

    CONSTRAINT 'amt_pmt_proc_day non-negative'
        CHECK (amt_pmt_proc_day >= 0),

    CONSTRAINT 'dt_pmt_issue YYYY-MM-DD'
        CHECK (dt_pmt_issue LIKE '____-__-__')
);

To populate Permit_Clean, I took all data from Permit_Raw except for:

Here's the SQL code that does this:

INSERT INTO Permit_Clean
  (id_pmt, cat_pmt_type, amt_pmt_proc_day, dt_pmt_issue)
SELECT "permit#"
       , permit_type
       , CAST(processing_time AS INTEGER)
       , mdy_to_ymd(issue_date)
  FROM Permit_Raw
 WHERE CAST(processing_time AS INTEGER) >= 0
;

mdy_to_ymd is an application-defined function (Hipp et al. "Application-Defined SQL Functions") that I created that converts dates from MM/DD/YYYY format to YYYY-MM-DD format.

I want to test the hypothesis "the distribution of processing time is the same each year."

I test this hypothesis for the years 2017, 2018, and 2019 by using data from 2016 to 2019.

To test this hypothesis, I use the Kolmogorov-Smirnov test ("Kolmogorov-Smirnov test", n.d.) with a significance level α = 0.10 to compare data from one year to the next. The null hypothesis of this test is that both sets of data are drawn from the same continuous distribution.

I conduct the Kolmogorov-Smirnov test using the R programming language (R Core Team, 2018) with these software packages:

Results

Here are the results of the Kolmogorov-Smirnov test for the years 2017, 2018, and 2019 (data from 2016 to 2019):

Table: Kolmogorov-Smirnov test. Year 2019 (prior year 2018).
PermitNyearNpriorDp
Easy Permit Process* 14,19013,1510.04302.16e-11
Electric Wiring* 15,80415,7940.01662.57e-02
Elevator Equipment1,6841,5830.03104.15e-01
New Construction1,6331,8100.03811.65e-01
Reinstate Revoked Pmt3241850.02471.00e+00
Renovation/Alteration*9,4819,7410.05381.68e-12
Scaffolding4414190.00611.00e+00
Signs* 4,0353,1750.11240.00e+00
Wrecking/Demolition* 1,0841,1470.05259.25e-02
Table: Kolmogorov-Smirnov test. Year 2018 (prior year 2017).
PermitNyearNpriorDp
Easy Permit Process13,15112,5490.01292.37e-01
Electric Wiring15,79417,1900.01251.55e-01
Elevator Equipment1,5831,4240.04141.52e-01
New Construction* 1,8101,8560.05121.63e-02
Reinstate Revoked Pmt1851890.02211.00e+00
Renovation/Alteration*9,7419,6530.12360.00e+00
Scaffolding4194590.01361.00e+00
Signs* 3,1752,8060.05808.87e-05
Wrecking/Demolition1,1471,2140.03484.72e-01
Table: Kolmogorov-Smirnov test. Year 2017 (prior year 2016).
PermitNyearNpriorDp
Easy Permit Process* 12,54912,0730.02614.65e-04
Electric Wiring* 17,19016,7500.01651.95e-02
Elevator Equipment* 1,4241,1870.06231.32e-02
New Construction* 1,8561,8750.04355.82e-02
Reinstate Revoked Pmt1892200.02111.00e+00
Renovation/Alteration*9,6539,4140.06850.00e+00
Scaffolding4594610.00641.00e+00
Signs* 2,8062,7070.04902.64e-03
Wrecking/Demolition* 1,2141,3120.06431.08e-02

Discussion

Based on the results of the Kolmogorov-Smirnov tests, I reject the null hypothesis for most of the permit types in at least one of the years 2017, 2018, 2019. Only permit types Reinstate Revoked Permit and Scaffolding were not rejected in any year.

This means that the rule list that I developed in my previous work may be an inaccurate tool, because simply describing the previous year's distribution of processing times is insufficient to predict the next year's distribution.

The next step in this research will be to quantify the difference in the distributions. This is important to do because if the distributions are technically different, but are practically the same, then the rule list developed in my previous research may still have some use. E.g.

References

Daniel Bassett

I built further on my neural network, assessing with the given data, can we estimate the functions for estimated processing time and likelihood of violation?

This is still an early version of the network. Future versions will include more parameters and optimized layers.

import pandas as pd
import numpy as np
import torch
from matplotlib import pyplot as plt

permits = pd.read_csv("Building_Permits.csv", thousands=",")

"""    converters = {"SUBTOTAL_PAID":float,
                 "SUBTOTAL_UNPAID":float,
                 "REPORTED_COST":float,
                 "TOTAL_FEE":float,
                 "PROCESSING_TIME":float}"""
'    converters = {"SUBTOTAL_PAID":float,\n                 "SUBTOTAL_UNPAID":float,\n                 "REPORTED_COST":float,\n                 "TOTAL_FEE":float,\n                 "PROCESSING_TIME":float}'
print(permits.columns.values)
print(permits["SUBTOTAL_PAID"].dtype)

clean = permits.dropna(axis=0)
['ID' 'PERMIT#' 'PERMIT_TYPE' 'REVIEW_TYPE' 'APPLICATION_START_DATE'
 'ISSUE_DATE' 'PROCESSING_TIME' 'STREET_NUMBER' 'STREET DIRECTION'
 'STREET_NAME' 'SUFFIX' 'WORK_DESCRIPTION' 'BUILDING_FEE_PAID'
 'ZONING_FEE_PAID' 'OTHER_FEE_PAID' 'SUBTOTAL_PAID' 'BUILDING_FEE_UNPAID'
 'ZONING_FEE_UNPAID' 'OTHER_FEE_UNPAID' 'SUBTOTAL_UNPAID'
 'BUILDING_FEE_WAIVED' 'ZONING_FEE_WAIVED' 'OTHER_FEE_WAIVED'
 'SUBTOTAL_WAIVED' 'TOTAL_FEE' 'CONTACT_1_TYPE' 'CONTACT_1_NAME'
 'CONTACT_1_CITY' 'CONTACT_1_STATE' 'REPORTED_COST' 'COMMUNITY_AREA'
 'WARD' 'XCOORDINATE' 'YCOORDINATE' 'VIOLATION_ID']
float64
subtotal_paid = clean["SUBTOTAL_PAID"]
total_fee = clean["TOTAL_FEE"]
processing_time = clean["PROCESSING_TIME"]
violation_id = clean["VIOLATION_ID"]
reported_cost = clean["REPORTED_COST"]

print((subtotal_paid - total_fee).mean())
-114.41881703176709
subtotal_paid.dtype
violation_id.dtype

fee_ratio = subtotal_paid / total_fee

print(fee_ratio)
5         1.0
10        1.0
12        1.0
13        1.0
16        1.0
         ... 
636722    0.0
636723    1.0
636724    1.0
636725    1.0
636726    0.0
Length: 521637, dtype: float64
permits_numpy = np.array(
    [
        fee_ratio,
        subtotal_paid,
        total_fee,
        reported_cost,
        processing_time,
        violation_id,
    ],
    dtype=np.float32,
)
permits_numpy.shape, permits_numpy.dtype, permits_numpy
((6, 521637),
 dtype('float32'),
 array([[1.000000e+00, 1.000000e+00, 1.000000e+00, ..., 1.000000e+00,
         1.000000e+00, 0.000000e+00],
        [5.000000e+01, 5.000000e+01, 7.500000e+01, ..., 2.250000e+02,
         5.500000e+02, 0.000000e+00],
        [5.000000e+01, 5.000000e+01, 7.500000e+01, ..., 2.250000e+02,
         5.500000e+02, 1.500000e+02],
        [1.000000e+00, 7.500000e+02, 1.000000e+03, ..., 6.472000e+03,
         1.753200e+04, 7.550000e+03],
        [1.600000e+01, 0.000000e+00, 0.000000e+00, ..., 0.000000e+00,
         0.000000e+00, 0.000000e+00],
        [2.604962e+06, 5.925308e+06, 2.621932e+06, ..., 3.162157e+06,
         3.162158e+06, 3.162159e+06]], dtype=float32))
permits_torch = torch.from_numpy(permits_numpy)
permits_torch.shape, permits_torch.dtype
(torch.Size([6, 521637]), torch.float32)
fee_ratio_actual = np.where(
    np.logical_and(fee_ratio >= 0, fee_ratio <= 1)
)

processing_time_actual = np.where(
    np.less_equal(processing_time, 5000)
)

fee_ratio.mean(), fee_ratio.var()
(0.9685662035380701, 0.027418734640841984)
# Now that we have the data organized and our variables isolated,
# let's visualize the data

fig = plt.figure(dpi=600)
plt.xlabel("Paid Fee Ratio")
plt.ylabel("Processing Time")
plt.plot(permits_numpy[0], permits_numpy[4], "bo")

print(max(processing_time))
4930.0
print(
    "It's not clear from this visualization. Let's throw in more variables and build our Pytorch neural network."
)

import torch.optim as optim
import torch.nn as nn
from sklearn import preprocessing
It's not clear from this visualization. Let's throw in more variables and build our Pytorch neural network.
frt = permits_torch[2]  # fee ratio tensor
rct = permits_torch[3]  # reported cost tensor
ptt = permits_torch[4]  # processing time tensor

frt = torch.tensor(frt).unsqueeze(1)
ptt = torch.tensor(ptt).unsqueeze(1)

ptt.shape
torch.Size([521637, 1])
n_samples = ptt.shape[0]
n_val = int(0.2 * n_samples)

shuffled_indices = torch.randperm(n_samples)

train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]

train_indices, val_indices
(tensor([390992, 455982, 478208,  ..., 228113,  40310, 300034]),
 tensor([135076, 107839,  12386,  ..., 344239, 229882,  73529]))
ptt_train = ptt[train_indices]
frt_train = frt[train_indices]

ptt_value = ptt[val_indices]
frt_value = frt[val_indices]

ptt_untrain = 0.1 * ptt_train
ptt_unval = 0.1 * ptt_value

linear_model = nn.Linear(1, 1)
linear_model(ptt_unval)
tensor([[ 1.4346],
        [-0.3379],
        [ 1.2130],
        ...,
        [-0.3379],
        [-0.3379],
        [-0.1902]], grad_fn=<AddmmBackward>)
linear_model.weight
Parameter containing:
tensor([[0.7386]], requires_grad=True)
linear_model.bias
Parameter containing:
tensor([-0.3379], requires_grad=True)
x = torch.ones(1)
linear_model(x)
tensor([0.4006], grad_fn=<AddBackward0>)
x = torch.ones(10, 1)
linear_model(x)
tensor([[0.4006],
        [0.4006],
        [0.4006],
        [0.4006],
        [0.4006],
        [0.4006],
        [0.4006],
        [0.4006],
        [0.4006],
        [0.4006]], grad_fn=<AddmmBackward>)
linear_model = nn.Linear(1, 1)
optimizer = optim.SGD(linear_model.parameters(), lr=1e-2)
linear_model.parameters()
<generator object Module.parameters at 0x0000022712B7DBA0>
list(linear_model.parameters())
[Parameter containing:
 tensor([[-0.7984]], requires_grad=True),
 Parameter containing:
 tensor([-0.0445], requires_grad=True)]
def training_loop(n_epochs, optimizer, model, loss_fn, ptt_train, ptt_value,
                  frt_train, frt_value):
    for epoch in range(1, n_epochs + 1):
        pt_train = model(ptt_train) # <1>
        loss_train = loss_fn(pt_train, frt_train)

        pt_val = model(ptt_value) # <1>
        loss_val = loss_fn(pt_val, frt_value)
        
        optimizer.zero_grad()
        loss_train.backward() # <2>
        optimizer.step()

        if epoch == 1 or epoch % 1000 == 0:
            print(f"Epoch {epoch}, Training loss {loss_train.item():.4f},"
                  f" Validation loss {loss_val.item():.4f}")

def loss_fn(pt, frt):
    squared_diffs = (pt - frt)**2
    return squared_diffs.mean()

linear_model = nn.Linear(1, 1) # <1>
optimizer = optim.Adam(linear_model.parameters(), lr=1e-2)

training_loop(
    n_epochs = 3000, 
    optimizer = optimizer,
    model = linear_model,
    loss_fn = loss_fn,
    ptt_train = ptt_untrain,
    ptt_value = ptt_unval, 
    frt_train = frt_train,
    frt_value = frt_value)

print()
print(linear_model.weight)
print(linear_model.bias)
Epoch 1, Training loss 78878864.0000, Validation loss 59914564.0000
Epoch 1000, Training loss 78812888.0000, Validation loss 59848468.0000
Epoch 2000, Training loss 78761344.0000, Validation loss 59796900.0000
Epoch 3000, Training loss 78721384.0000, Validation loss 59756920.0000

Parameter containing:
tensor([[25.6018]], requires_grad=True)
Parameter containing:
tensor([28.6782], requires_grad=True)
linear_model = nn.Linear(1, 1)
optimizer = optim.SGD(linear_model.parameters(), lr=1e-2)

training_loop(
    n_epochs=3000,
    optimizer=optimizer,
    model=linear_model,
    loss_fn=nn.MSELoss(),  # <1>
    ptt_train=ptt_untrain,
    ptt_value=ptt_unval,
    frt_train=frt_train,
    frt_value=frt_value,
)

print()
print(linear_model.weight)
print(linear_model.bias)
Epoch 1, Training loss 78870432.0000, Validation loss 59906128.0000
Epoch 1000, Training loss 78098944.0000, Validation loss 59167068.0000
Epoch 2000, Training loss 78098944.0000, Validation loss 59167068.0000
Epoch 3000, Training loss 78098944.0000, Validation loss 59167068.0000

Parameter containing:
tensor([[23.6578]], requires_grad=True)
Parameter containing:
tensor([821.0441], requires_grad=True)
seq_model = nn.Sequential(
    nn.Linear(1, 13), nn.Tanh(), nn.Linear(13, 1)  # <1>
)  # <2>
seq_model
Sequential(
  (0): Linear(in_features=1, out_features=13, bias=True)
  (1): Tanh()
  (2): Linear(in_features=13, out_features=1, bias=True)
)
[param.shape for param in seq_model.parameters()]
[torch.Size([13, 1]), torch.Size([13]), torch.Size([1, 13]), torch.Size([1])]
for name, param in seq_model.named_parameters():
    print(name, param.shape)
0.weight torch.Size([13, 1])
0.bias torch.Size([13])
2.weight torch.Size([1, 13])
2.bias torch.Size([1])
from collections import OrderedDict

seq_model = nn.Sequential(
    OrderedDict(
        [
            ("hidden_linear", nn.Linear(1, 8)),
            ("hidden_activation", nn.Tanh()),
            ("output_linear", nn.Linear(8, 1)),
        ]
    )
)

seq_model
Sequential(
  (hidden_linear): Linear(in_features=1, out_features=8, bias=True)
  (hidden_activation): Tanh()
  (output_linear): Linear(in_features=8, out_features=1, bias=True)
)
for name, param in seq_model.named_parameters():
    print(name, param.shape)
hidden_linear.weight torch.Size([8, 1])
hidden_linear.bias torch.Size([8])
output_linear.weight torch.Size([1, 8])
output_linear.bias torch.Size([1])
seq_model.output_linear.bias
Parameter containing:
tensor([-0.2399], requires_grad=True)
optimizer = optim.SGD(
    seq_model.parameters(), lr=1e-3
)  # <1>

training_loop(
    n_epochs=5000,
    optimizer=optimizer,
    model=seq_model,
    loss_fn=nn.MSELoss(),
    ptt_train=ptt_untrain,
    ptt_value=ptt_unval,
    frt_train=frt_train,
    frt_value=frt_value,
)

print("output", seq_model(ptt_unval))
print("answer", frt_value)
print("hidden", seq_model.hidden_linear.weight.grad)
Epoch 1, Training loss 77971072.0000, Validation loss 59061600.0000
Epoch 1000, Training loss 77970968.0000, Validation loss 59060816.0000
Epoch 2000, Training loss 77970632.0000, Validation loss 59060944.0000
Epoch 3000, Training loss 77970728.0000, Validation loss 59060132.0000
Epoch 4000, Training loss 77975160.0000, Validation loss 59055040.0000
Epoch 5000, Training loss 77970688.0000, Validation loss 59059828.0000
output tensor([[1004.1236],
        [ 654.4979],
        [1004.1236],
        ...,
        [ 654.4979],
        [ 654.4979],
        [ 654.5068]], grad_fn=<AddmmBackward>)
answer tensor([[175.],
        [ 50.],
        [240.],
        ...,
        [150.],
        [300.],
        [575.]])
hidden tensor([[ 1.4657e+02],
        [-5.6897e-02],
        [ 6.7035e-02],
        [ 4.1606e-01],
        [-4.1232e-02],
        [ 9.1959e-03],
        [-4.4917e-01],
        [ 4.8263e-01]])
from matplotlib import pyplot as plt

t_range = torch.arange(20.0, 90.0).unsqueeze(1)

fig = plt.figure(dpi=600)
plt.xlabel("Total Fees")
plt.ylabel("Processing Time")
plt.plot(ptt.numpy(), frt.numpy(), "o")
plt.plot(
    t_range.numpy(),
    seq_model(0.1 * t_range).detach().numpy(),
    "c-",
)
plt.plot(
    ptt.numpy(), seq_model(0.1 * ptt).detach().numpy(), "kx"
)

from sklearn.metrics import r2_score

Jupyter Notebook stalled before computing r2 score. In summary, this small neural net did not find compelling relations between fees and processing time. It is probably due to the small amount of parameters given the large dataset. Further iterations of the NN will include more paramaters and optimized built-ins to obtain a more accurate predication for processing time and violation probability.