# JUNTO Practice: Data Analysis, Chicago Buildings (part 2)

Discussed on September 22, 2020.

Datasets:

**Continue your analysis from the previous meeting. Expand on your original work or find new problems to work on.**

## Solutions

Click to see:

### Oscar Martinez

# JUNTO Chicago Permit Data 2

## Do Word Embeddings Contain More Info Than One-Hot Encodings?

Last time, we discovered that even with the sparse representation of the text in the WORK_DESCRIPTION column, we could do a better job of estimating the reported cost. This week, we wanted to continue that work by trying to predict the reported cost using word embeddings.

### Word Embeddings

Word embeddings are dense vector representations of words. These representations encode relationships between words as they appear in the text corpus used to train/create the embeddings.

To start, we will attempt to use pre-trained embeddings from the spaCy library.

```
import spacy
import en_core_web_lg
nlp = en_core_web_lg.load()
processed_words = list(
nlp.pipe(
modeling_permits[
["log10_fees", "WORK_DESCRIPTION"]
]["WORK_DESCRIPTION"],
disable=["tagger", "parser", "ner"],
batch_size=50,
)
)
word_vectors = [wv.vector for wv in processed_words]
word_vectors = pd.DataFrame(word_vectors)
```

This returns a 300-dimensional vector representing the work description. We concatenate this to our original input, replacing the one-hot encoded work description.

Again we train a support vector regressor to predict the log transformed reported cost from our word vectors. This time we don't need to use the pipeline because we're using the numerical vectors as inputs.

`reg = svm.LinearSVR(max_iter=3000) reg.fit(X_train, y_train) reg.score(X_test, y_test)`

```
0.650037878656426
```

### Evaluating Results

Unfortunately, this is worse than our one-hot encoded R-Squared result of 0.72.

There are a number of potential problems with our naive use of spaCy's embeddings.

- The text corpus the spaCy embeddings were trained on may be significantly different from the text present in the WORK_DESCRIPTION column. As a result, it may not encode some of the basic relationships between words in our data.
- When we process the work descriptions, we actually create word embeddings for the individual words. We then have a 300xW matrix representing our description that is then averaged to get one vector for the entire description. Information may be lost here

To improve on these, in the future, we might want to:

- Train word embeddings on our data before using their vector representations.
- Use more complex models, like neural networks, in order to not lose the time component of our work descriptions. State-of-the-art methods like the LSTM and Transformer better account for the complexity of sequence in text data.

# How Well Can We Predict Permits From Word Description Alone?

Previously we focused on reported cost. Unfortunately, there is reason to believe the cost reported on permit applications is not a useful measure of truth because of the incentives at play for the applicant.

We now turn our attention to the actual permit fees and type of permit applied for.

We want to test:

- Can we predict the PERMIT_TYPE being applied for from the WORK_DESCRIPTION?
- Can we predict the TOTAL_FEES from the WORK_DESCRIPTION?

### Predicting Permit Type

We begin by trying to categorize the type of permit using one-hot encoded vectors to represent the work description. The pipeline for the data is similar to before except we use a support vector classifier instead of a regressor, since the task has changed.

```
text_clf = Pipeline([
('text', ColumnTransformer(transformers=[('WORK_DESCRIPTION',
Pipeline([
('vect',CountVectorizer()),
('tfidf',TfidfTransformer()),
]),
'WORK_DESCRIPTION')],remainder='passthrough'),
),
('clf', svm.LinearSVC(max_iter=5000))
])
text_clf.fit(X_train,y_train)
text_clf.score(X_test,y_test)
```

```
0.9640369221893704
```

We are able to predict the permit type with a 96.4% accuracy on the testing data. This suggests the simple choice of words is predictive of the type of permit being applied for.

### Problems

While the 96.4% is encouraging, it is also suspiciously high. Unless we know more about the generation of the WORK_DESCRIPTION column and how it might encode the form-specific information on the different PERMIT_TYPE forms, we cannot be sure we'd be able to predict the requisite permits from a description alone. For example, it could be the case that the work description for all elevator permits includes a specific set of answers to a checkbox on the elevator permit form. Our model would then be able to use this to overfit based on that, rather than on the presence of the word elevator.

Further work should:

- Examine the coefficients determining classification to see if they are the result of natural language work descriptions versus form-to-dataset generation quirks.
- Attempt to evaluate the results with natural work descriptions sourced from internet forums.

## BONUS ROUND and Scratch Work

### Predicting Total Permit Fees

Given our model's high performance on classifying permits based on descriptions. It may be better at predicting permit fees than reported costs. In particular, we may find permit costs to have a stronger relation to description as they are generated by a review of an application that includes a work description.

I completed some preliminary work on this using a similar methodology to last week (log transformed TOTAL_FEE and one-hot encodings for WORK_DESCRIPTION) and ultimately attain a root mean squared error that corresponds to a factor of 2 deviation from the truth, on average.

More research would need to be done to evaluate how useful this sort of prediction is compared to the available calculator online.

While I didn't have time to write-up my analysis, some scratch work is included below. For each PERMIT_TYPE I report the mean and standard deviation of TOTAL_FEE for that permit group as well as the 10th, 50th, 90th, and 99th percentiles of TOTAL_FEE. I then randomly sample 5 permits from each group and report the predicted cost and the true final cost in original (non-log transformed) terms.

```
'''
===NEW GROUP===
PERMIT - EASY PERMIT PROCESS
337.2890527846054
6109.386357665037
0.10 70.0
0.50 250.0
0.90 550.0
0.99 1375.0
Name: TOTAL_FEE, dtype: float64
[310.59043526 191.92247361 349.86647242 169.84179651 449.11951939]
log10_fees
584230 550.0
430185 200.0
632897 550.0
344317 100.0
7081 575.0
0 : REVISION TO PERMIT NO. 100763465; CHANGE OF HVAC TO SUN MECHANICAL SYSTEMS
1 : CHANGE HVAC CONTRACTOR TO K&F COOLING FOR PERMIT 100282299
2 : REPLACE THE EXISTING FIVE TON RTU WITH ONE CARRIER 5 TON RTU, SELF CONTAINED, AIR COOLED, R410, LOCATED ON THE ROOF. THE RTU SERVES THE SECOND FLOOR.
3 : REPLACE 68 WINDOWS. ALL WORK SAME AS EXISTING.
4 : REPLACE A FEW BRICKS / SPOT TUCKPOINT
===NEW GROUP===
PERMIT - ELECTRIC WIRING
99.12099523896315
465.1077306641869
0.10 40.0
0.50 75.0
0.90 150.0
0.99 675.0
Name: TOTAL_FEE, dtype: float64
[219.03726265 53.16560261 106.1888374 111.86819569 103.11903806]
log10_fees
46266 100.0
15119 50.0
620138 75.0
68417 215.0
579428 75.0
0 : temp 400 amp service
1 : INSTALL 1" OR 2" EMPTY CONDUIT FOR AT&T FIBER OPTICS.
2 : LOW VOLTAGE ALARM
3 : ELECTRICAL INSTALLATIONS AND REVISIONS OF THE CART RETURN CORRIFOR TO ONE STOP EXSPRESS.PASSAGEWAY AT TERMINAL 5. WORK IS ASSOCIATED WITH BUILDING PERMIT NUMBER 100485355
4 : INSTALLATION OF LOW VOLTAGE BURGLAR ALARM
===NEW GROUP===
PERMIT - ELEVATOR EQUIPMENT
713.5431690821256
1688.7435975594692
0.10 100.0
0.50 305.0
0.90 1430.0
0.99 7263.0
Name: TOTAL_FEE, dtype: float64
[ 278.69147449 99.88069882 405.02060757 704.88019834 1193.64505838]
log10_fees
23090 270.0
217481 100.0
38568 270.0
622595 425.0
440097 1275.0
0 : Installation Of One (10 Harmar CPL 600 VPL With Safety Landing Gate At Rear Deck Of Residence 750lb.cap, 10 FPM, With ADA Hands Free Emergency Phone. Pursuant To The Plans Sumbitted.
1 : Install One (1) 750lb.cap. Vertical Wheelchair Lift on Rear of Home ADA Phone Required. Pursuant to Plans Submitted.
2 : INSTALL ONE HYDRAULIC PASSENGER ELEVATOR AS PER PLANS
3 : FURNISH AND INSTALL ONE VERTICAL WHEELCHAIR LIFT
4 : INSTALL NEW DOOR RESTRICTORS,NEW LIGHT CIRCUI, SAFETY TESTS
===NEW GROUP===
PERMIT - FOR EXTENSION OF PMT
1582.9296551724137
1982.6095143265418
0.10 73.5000
0.50 823.3600
0.90 4587.5250
0.99 7051.2025
Name: TOTAL_FEE, dtype: float64
[ 174.77882012 494.55939623 164.93942334 1990.50896332 809.10091406]
log10_fees
23544 0.25
86389 6658.38
116014 122.50
89821 1109.11
145004 92.25
0 : REINSTATE TENT PERMIT FOR THE SHAMROCK SHUFFLE 2006 FOR THE DATE OF 4-2-06
1 : reinstate permit 1048964 for a 60 x 180 tent for oyster with the oyster from 10-19-06 to 10-21-06
2 : FIRST EXTENSION OF PERMIT # 100107982 AT 25% ORIGINAL PERMIT FEES. NO CHANGE TO DESCRIPTION OF WORK : ERECT NEW 2 STORY 4A SFR W/DETACHED GARAGE AS PER PLAN
3 : FIRST EXTENSION OF PERMIT # 100090817 ISSUED 02/07/06 AT 25% ORIGINAL PERMIT FEE. NO CHANGE TO DESCRIPTION OF WORK : INTERIOR ALTERATIONS TO UPGRADE ELECTRICAL MECHANICAL & PLUMBING AS PER PLANS
4 : FIRST EXTENSION OF PERMIT #100129086 AT 25% OF ORIGINAL COST NO CHANGE TO DESCRIPTION OF WORK : erect 2 story sfr with rear deck, roof deck, and detached 2 car frame garage as per plan.
===NEW GROUP===
PERMIT - NEW CONSTRUCTION
7244.593596940957
37581.57409388481
0.10 325.000
0.50 2150.000
0.90 9603.890
0.99 114347.364
Name: TOTAL_FEE, dtype: float64
[1149.9864995 1087.87652514 1476.99346442 2131.25936033 1344.82017411]
log10_fees
102692 4743.75
567652 7984.04
58907 1817.75
373001 1625.00
436937 3590.00
0 : PERMIT EXPIRES ON 09/08/2015 ERECTION STARTS: 7/7/2015, ERECTION ENDS: 7/9/2015. REINSTATEMENT OF PERMIT 100549306 TO ERECT 60' X 100' AND 20' X 60' TENTS FOR EVENT JULY 8, 2015
1 : TELECOMMUNICATION SITE WITH EQUIPMENT ON CONCRETE PAD WITH ANTENNAS ON NEW 100' MONOPOLE
2 : ERECT (1) 15' x 30', (1) 40' x 40', AND (1) 60' x 82' TENTS, FOR THE DUSABLE MUSEUM FOR EVENT ON APRIL 1, 2011. INSTALL 3/29/10 AND REMOVE 4/4/11.
3 : ERECT A 2 STORY, IVA FRAME SINGLE FAMILY RESIDENCE WITH BASEMENT, AND DETACHED 3 CAR GARAGE WITH ROOF DECK PER PLANS
4 : PPROPOSED 2 STORY FRAME SINGLE FAMILY RESIDENCE WITH BASEMENT AND DETACHED 2 CAR FRAME GARAGE AS PER PLANS.
===NEW GROUP===
PERMIT - PORCH CONSTRUCTION
228.79934775589282
135.9753682451586
0.10 135.0
0.50 185.0
0.90 350.0
0.99 765.2
Name: TOTAL_FEE, dtype: float64
[446.47015399 348.60692786 116.8850952 215.41215804 245.86964149]
log10_fees
27098 285.00
47074 210.00
25931 235.00
83239 176.25
32512 210.00
0 : REPAIR OF EXISTING (4) PORCHES LOCATED AT 5030-5036 SOUTH EVANS AVE. THE WORK INCLUDES ENHANCING OF FRAMING SYSTEM TO COMPLY W/CURRENT CITY OF CHICAGO STANDARDS. -CONDITIONAL PERMIT, SUBJECT TO FIELD VERIFICATION.
1 : REPLACE EXISTING 2 STORY REAR ENCLOSED PORCH WITH DOH PROTOTYPE.
2 : REBUILD BACK PORCH AS PER PLANS
3 : CONSTRUCTION OF A NEW SINGLE STORY WOOD DECK SAME AS EXISTING AS PER PLANS
4 : Replace rear open wood stair system in kind
===NEW GROUP===
PERMIT - REINSTATE REVOKED PMT
1046.3014823261117
3454.7208947492527
0.10 90.040
0.50 281.250
0.90 1939.890
0.99 12463.953
Name: TOTAL_FEE, dtype: float64
[282.02981197 117.21368541 112.54734808 175.64466113 142.10276903]
log10_fees
424339 755.50
222026 150.00
90253 112.50
171604 362.50
369445 142.19
0 : REINSTATEMENT OF PERMIT #100340466 TO ERECT SPECIAL EVENT TENT (2) 40' x 80' FROM 7/4/2011 TO 9/4/2011 - FOR 60 DAYS ONLY PERMIT EXPIRES ON 9/4/2011
1 : REINSTATEMENT OF PERMIT #100292530 TO ERECT SPECIAL EVENT TENT (1) 60' x 60' FROM 5/2/10 TO 07/2/2010 - FOR 60 DAYS ONLY PERMIT EXPIRES ON 07/2/2010
2 : REINSTATE PERMIT 100191228 AT 50%. OWNER AND CONTRACTORS TO REMAIN THE SAME.
3 : REINSTATEMENT OF PERMIT #100340467 TO ERECT SPECIAL EVENT TENT (1) 60' x 60' FROM 7/4/2011 TO 9/4/2011 - FOR 60 DAYS ONLY PERMIT EXPIRES ON 9/4/2011
4 : REINSTATEMENT OF PERMIT #100340463 TO ERECT SPECIAL EVENT TENT (10) 20' x 20' FROM 7/4/11 THROUGH 09/4/11 - FOR 60 DAYS ONLY PERMIT EXPIRES ON 09/4/2011
===NEW GROUP===
PERMIT - RENOVATION/ALTERATION
2398.085947746459
19770.884603309925
0.10 200.0000
0.50 800.0000
0.90 3825.0000
0.99 28229.8246
Name: TOTAL_FEE, dtype: float64
[ 806.56658129 624.01608924 1514.58805649 589.4067932 1085.86985815]
log10_fees
253257 1575.00
217008 375.00
568158 892.49
311639 14825.00
474200 2825.36
0 : PORCH REPAIRS
1 : ADDITION OF STAIR TO CONNECT UNITS 27A WITH 28A, BATHROOMS AND KITCHEN REMODEL.
2 : RECONFIGURE INTERIOR STAIRWELL, RECONFIGURE EXISTING BATHROOMS, AND RELOCATE DOOR FOR EXISTING BARBER SHOP AS PER PLANS. NO HVAC & PLUMBING WORK UNDER THIS PERMIT.
3 : SIDE STAIR REPLACEMENT TO CORRECT VIOLATION PER PLANS
4 : REMOVE AND REPLACE 6 EXISTING ANTENNAS WITH NEW TECHNOLOGY AND ADD 1 NEW EQUIPMENT CABINET TO AND EXITING WIRELESS COMMUNICATIONS FACILITY AS PER PLANS. SITE #CH60XC129
===NEW GROUP===
PERMIT - SCAFFOLDING
184.20639979602242
197.83836836490553
0.10 50.0
0.50 100.0
0.90 400.0
0.99 900.0
Name: TOTAL_FEE, dtype: float64
[86.43206368 74.20002061 55.85251088 55.40651441 35.53221446]
log10_fees
372367 100.0
277412 400.0
118612 50.0
176284 200.0
355462 600.0
0 : ERECT ONE WERNER SWING STAGE SCAFFOLD FROM 09/28/2010 - 10/28/2010
1 : ERECT 1 SCAFFOLD FROM 4/6/12 TO 4/6/13
2 : ERECT SIX SKYCLIMBER SCAFFOLDS FROM 10/01/2008 - 10/01-2009, NO OTHER REPAIRS ON THIS PERMIT.
3 : ERECT 2 SCAFFOLDS FROM 11/3/14 TO 11/3/15
4 : ERECT 1 SCAFFOLD FROM 9/16/14 TO 9/16/15
===NEW GROUP===
PERMIT - SIGNS
291.8125620140688
206.67145044804147
0.10 119.5
0.50 250.0
0.90 400.0
0.99 1200.0
Name: TOTAL_FEE, dtype: float64
[288.45938182 217.15004343 306.82055962 244.10472127 239.18584651]
log10_fees
567939 300.0
207707 250.0
482102 250.0
541817 250.0
232278 250.0
0 : NON-ILLUMINATED ALUMINUM BLADE SIGN WITH DOUBLE-SIDED VINYL GRAPHICS
1 : NON-ILLUMINATED ALUMINUM BLADE SIGN WITH DOUBLE-SIDED VINYL GRAPHICS
2 : INSTALLATION OF (1) NON ILLUMINATED AWING TO BE MOUNTED TO THE EAST BUILDING ELEVATION. AWNING TO READ ELIMINATING RACISM EMPOWERING WOMEN YWCA METROPOLITAN CHICAGO
3 : REMODEL THE EXISTING PYLON SIGN TO REMOVE THE TWO TENANTS NO LONGER IN THE SHOPPING CENTER AND INSTALLATION OF (1) NEW EMC TO BE MOUNTED IN THE EXISTING SIGN CABINET.
4 : INSTALLATION OF VINYL FOR SUITE B19
===NEW GROUP===
PERMIT - WRECKING/DEMOLITION
388.13682893617016
709.8742387699102
0.10 110.00
0.50 300.00
0.90 500.00
0.99 1692.92
Name: TOTAL_FEE, dtype: float64
[163.62839374 91.80030635 254.51155982 245.9460905 224.33865934]
log10_fees
90117 110.0
317169 300.0
407269 300.0
530475 500.0
292410 300.0
0 : WRECK & REMOVE 2 STORY BRICK RESIDENTIAL STRUCTURE
1 : WRECK & REMOVE 1 STORY FRAME STRUCTURE & TRAILER
2 : DEMOLITION AND REMOVAL OF A 2 STORY FRAME RES BUILDING.
3 : WRECK & REMOVE EXISTING CONCRETE PLANT. PLANT C. GARAGE.
4 : WRECK AND REMOVE A 1 STORY STEEL FRAMED MASONRY BUILDING.
'''
```

### John Lekberg

# In the City of Chicago, does the distribution of permit processing times change year over year?

In the City of Chicago, most construction work requires a building permit. In my previous research, I created a list of rules that described the distribution of processing times for a given year, with the goal of being able to predict the processing times for future years. But, I made an untested assumption that the distribution of processing times remained the same, year over year. The goal of this research is to test that assumption. I find that the distribution of processing times changes year over year for most of the permits. This means that the rule list previously created may be useless. Future research should quantify the change in distributions to assess the practical implications on the rule list.

## Introduction

In Chicago, most construction work requires a **building permit.**
If I need a permit, I want to know:

- How much will it cost? (Cost question.)
- How long will it take to be processed? (Time question.)

The City of Chicago answers the cost question by providing a calculator on their website. ("Calculate the Cost of a Building Permit", n.d.)

I couldn't find an answer to the time question, so I created a list of a rules that describe the distribution of processing time based on permit:

- For a Scaffolding permit, 90% are processed same-day. 99% are processed within 4 days.
- For a Reinstate Revoked Permit permit, 90% are processed same-day. 99% are processed within 21 days.
- For an Easy Permit Process permit, 90% of applications are processed within 1 day. 99% are processed within 27 days.
- For an Elevator Equipment permit, 90% are processed within 2 days. 99% are processed within 14 days.
- For an Electric Wiring permit, 90% are processed within 6 days. 99% are processed within 44 days.
- For a Renovation/Alteration permit, 90% are processed within 108 days. 99% are processed within 353 days.
- For a New Construction permit, 90% are processed within 111 days. 99% are processed within 469 days.
- For a Wrecking/Demolition permit, 90% are processed within 120 days. 99% are processed within 449 days.
- For a Signs permit, 90% are processed within 254 days. 99% are processed within 1,141 days.

(Based on 2019 data.)

However, when I built this rule list, I assumed that the distribution of processing times stayed the same year over year. If this assumption doesn't hold, then this rule list may not be useful, because describing the previous year's permit processing may not help you predict the

The goal of this research is to test my assumption that that the distribution of processing times stays the same, year over year.

## Methods

The City of Chicago provides a dataset of building permits ("Building Permits", n.d.).
I downloaded the dataset as a comma-separated value (CSV) file and imported it
into a SQLite (Hipp et al. "SQLite") database, `permit.db`

:

```
$ sqlite3 permit.db
> .mode csv
> .import Building_Permits.csv Permit_Raw
```

I identified 4 columns of interested in `Permit_Raw`

:

`permit#`

Tracking number assigned at beginning of permit application process.

`permit_type`

Type of permit

`processing_time`

Number of days between

`application_start_date`

and`issue_date`

.`issue_date`

Date when City determined permit ready to issue, subject to payment of permit fees.

(Column descriptions are from "Building Permits" [n.d.].)

Based on these columns, I created a new table, `Permit_Clean`

, that is a
cleaned up version of `Permit_Raw`

.
Here's the schema:

```
CREATE TABLE Permit_Clean (
id_pmt TEXT NOT NULL,
cat_pmt_type TEXT NOT NULL,
amt_pmt_proc_day INTEGER NOT NULL,
dt_pmt_issue TEXT NOT NULL,
PRIMARY KEY (id_pmt),
CONSTRAINT 'cat_pmt_type non-empty'
CHECK (cat_pmt_type <> ''),
CONSTRAINT 'amt_pmt_proc_day non-negative'
CHECK (amt_pmt_proc_day >= 0),
CONSTRAINT 'dt_pmt_issue YYYY-MM-DD'
CHECK (dt_pmt_issue LIKE '____-__-__')
);
```

To populate `Permit_Clean`

, I took all data from `Permit_Raw`

except for:

- Rows where
`processing_time`

was negative. (82 of 637,365 total rows.)

Here's the SQL code that does this:

```
INSERT INTO Permit_Clean
(id_pmt, cat_pmt_type, amt_pmt_proc_day, dt_pmt_issue)
SELECT "permit#"
, permit_type
, CAST(processing_time AS INTEGER)
, mdy_to_ymd(issue_date)
FROM Permit_Raw
WHERE CAST(processing_time AS INTEGER) >= 0
;
```

`mdy_to_ymd`

is an application-defined function (Hipp et al.
"Application-Defined SQL Functions") that I created that converts dates from
`MM/DD/YYYY`

format to `YYYY-MM-DD`

format.

I want to test the hypothesis "the distribution of processing time is the same each year."

I test this hypothesis for the years 2017, 2018, and 2019 by using data from 2016 to 2019.

To test this hypothesis, I use the Kolmogorov-Smirnov test ("Kolmogorov-Smirnov test", n.d.) with a significance level α = 0.10 to compare data from one year to the next. The null hypothesis of this test is that both sets of data are drawn from the same continuous distribution.

I conduct the Kolmogorov-Smirnov test using the R programming language (R Core Team, 2018) with these software packages:

- stats (part of the standard library).
- DBI (R-SIG-DB, 2019).
- RSQLite (Müller, 2020).

## Results

Here are the results of the Kolmogorov-Smirnov test for the years 2017, 2018, and 2019 (data from 2016 to 2019):

- Significance level is α = 0.10.
- Column
**N**is the number of permits of that type issued in the year of interest._{year} - Column
**N**is similar to_{prior}**N**, but is for the prior year._{year} - Column
**D**is the test statistic of the Kolmogorov-Smirnov test. - Column
**p**is the p-value of the Kolmogorov-Smirnov test.- The
**null hypothesis**is that`amt_pmt_proc_day`

has the same distribution for both**year**and**prior**.

- The
- Permit types where
**p**≤ α are marked like this*.

Permit | N_{year} | N_{prior} | D | p |
---|---|---|---|---|

Easy Permit Process* | 14,190 | 13,151 | 0.0430 | 2.16e-11 |

Electric Wiring* | 15,804 | 15,794 | 0.0166 | 2.57e-02 |

Elevator Equipment | 1,684 | 1,583 | 0.0310 | 4.15e-01 |

New Construction | 1,633 | 1,810 | 0.0381 | 1.65e-01 |

Reinstate Revoked Pmt | 324 | 185 | 0.0247 | 1.00e+00 |

Renovation/Alteration* | 9,481 | 9,741 | 0.0538 | 1.68e-12 |

Scaffolding | 441 | 419 | 0.0061 | 1.00e+00 |

Signs* | 4,035 | 3,175 | 0.1124 | 0.00e+00 |

Wrecking/Demolition* | 1,084 | 1,147 | 0.0525 | 9.25e-02 |

Permit | N_{year} | N_{prior} | D | p |
---|---|---|---|---|

Easy Permit Process | 13,151 | 12,549 | 0.0129 | 2.37e-01 |

Electric Wiring | 15,794 | 17,190 | 0.0125 | 1.55e-01 |

Elevator Equipment | 1,583 | 1,424 | 0.0414 | 1.52e-01 |

New Construction* | 1,810 | 1,856 | 0.0512 | 1.63e-02 |

Reinstate Revoked Pmt | 185 | 189 | 0.0221 | 1.00e+00 |

Renovation/Alteration* | 9,741 | 9,653 | 0.1236 | 0.00e+00 |

Scaffolding | 419 | 459 | 0.0136 | 1.00e+00 |

Signs* | 3,175 | 2,806 | 0.0580 | 8.87e-05 |

Wrecking/Demolition | 1,147 | 1,214 | 0.0348 | 4.72e-01 |

Permit | N_{year} | N_{prior} | D | p |
---|---|---|---|---|

Easy Permit Process* | 12,549 | 12,073 | 0.0261 | 4.65e-04 |

Electric Wiring* | 17,190 | 16,750 | 0.0165 | 1.95e-02 |

Elevator Equipment* | 1,424 | 1,187 | 0.0623 | 1.32e-02 |

New Construction* | 1,856 | 1,875 | 0.0435 | 5.82e-02 |

Reinstate Revoked Pmt | 189 | 220 | 0.0211 | 1.00e+00 |

Renovation/Alteration* | 9,653 | 9,414 | 0.0685 | 0.00e+00 |

Scaffolding | 459 | 461 | 0.0064 | 1.00e+00 |

Signs* | 2,806 | 2,707 | 0.0490 | 2.64e-03 |

Wrecking/Demolition* | 1,214 | 1,312 | 0.0643 | 1.08e-02 |

## Discussion

Based on the results of the Kolmogorov-Smirnov tests, I reject the null
hypothesis for most of the permit types in at least one of the years 2017,
2018, 2019.
Only permit types **Reinstate Revoked Permit** and **Scaffolding** were not
rejected in any year.

This means that the rule list that I developed in my previous work may be an inaccurate tool, because simply describing the previous year's distribution of processing times is insufficient to predict the next year's distribution.

The next step in this research will be to quantify the difference in the distributions. This is important to do because if the distributions are technically different, but are practically the same, then the rule list developed in my previous research may still have some use. E.g.

- Processing time in year
`N`is distributed one day slower than in year`N`-1. - The distributions are technically different.
- The rule derived from year
`N`-1 may say "For an**XYZ**permit, 90% are processed within 30 days. 99% are processed within 35 days." - The correct rule for year
`N`would be "For an**XYZ**permit, 90% are processed within 31 days. 99% are processed within 36 days.". - 30 days vs 31 days (and 35 days vs 36 days) isn't much of a difference. As a result, I think that the slightly inaccurate rule set would still be useful for predicting how long that processing a permit will take.

## References

- "Calculate the Cost of a Building Permit." n.d. City of Chicago. Accessed 2020-09-02. https://www.chicago.gov/city/en/depts/bldgs/provdrs/permits/svcs/permit_fee_calculator.html (WebArchive: https://web.archive.org/web/20200902183956/https://www.chicago.gov/city/en/depts/bldgs/provdrs/permits/svcs/permit_fee_calculator.html)
- "Building Permits." [Dataset.] Last modified 2020-09-02. City of Chicago. Accessed 2020-09-02. https://data.cityofchicago.org/Buildings/Building-Permits/ydr8-5enu
- Hipp, D. R., Kennedy, D., Mistachkin, J. "SQLite." [Software. Version 3.28.0.] SQLite Development Team. 2019. https://sqlite.org/src/info/884b4b7e502b4e99
- Hipp, D. R., Kennedy, D., Mistachkin, J. "Application-Defined SQL Functions." SQLite Development Team. Accessed 2020-09-02. https://www.sqlite.org/appfunc.html (WebArchive: https://web.archive.org/web/20200907031255/https://www.sqlite.org/appfunc.html)
- R Core Team. "R: A language and environment for statistical computing." [Software. Version 3.5.2.] 2018. R Foundation for Statistical Computing. https://www.R-project.org/
- R Special Interest Group on Databases (R-SIG-DB), Hadley Wickham, Kirill Müller. "DBI: R Database Interace." [Software. Version 1.1.0.] 2019. R Consortium. https://CRAN.R-project.org/package=DBI
- Kirill Müller and Hadley Wickham and David A. James and Seth Falcon. "RSQLite: 'SQLite' Interface for R." [Software. Version 2.2.0.] 2020. https://CRAN.R-project.org/package=RSQLite
- "Kolmogorov-Smirnov test." Encyclopedia of Mathematics. Accessed 2020-09-21. http://encyclopediaofmath.org/index.php?title=Kolmogorov-Smirnov_test&oldid=47513 (WebArchive: http://web.archive.org/web/20200922013836/https://encyclopediaofmath.org/index.php?title=Kolmogorov-Smirnov_test)

### Daniel Bassett

I built further on my neural network, assessing with the given data, can we estimate the functions for estimated processing time and likelihood of violation?

This is still an early version of the network. Future versions will include more parameters and optimized layers.

`import pandas as pd import numpy as np import torch from matplotlib import pyplot as plt permits = pd.read_csv("Building_Permits.csv", thousands=",") """ converters = {"SUBTOTAL_PAID":float, "SUBTOTAL_UNPAID":float, "REPORTED_COST":float, "TOTAL_FEE":float, "PROCESSING_TIME":float}"""`

```
' converters = {"SUBTOTAL_PAID":float,\n "SUBTOTAL_UNPAID":float,\n "REPORTED_COST":float,\n "TOTAL_FEE":float,\n "PROCESSING_TIME":float}'
```

`print(permits.columns.values) print(permits["SUBTOTAL_PAID"].dtype) clean = permits.dropna(axis=0)`

```
['ID' 'PERMIT#' 'PERMIT_TYPE' 'REVIEW_TYPE' 'APPLICATION_START_DATE'
'ISSUE_DATE' 'PROCESSING_TIME' 'STREET_NUMBER' 'STREET DIRECTION'
'STREET_NAME' 'SUFFIX' 'WORK_DESCRIPTION' 'BUILDING_FEE_PAID'
'ZONING_FEE_PAID' 'OTHER_FEE_PAID' 'SUBTOTAL_PAID' 'BUILDING_FEE_UNPAID'
'ZONING_FEE_UNPAID' 'OTHER_FEE_UNPAID' 'SUBTOTAL_UNPAID'
'BUILDING_FEE_WAIVED' 'ZONING_FEE_WAIVED' 'OTHER_FEE_WAIVED'
'SUBTOTAL_WAIVED' 'TOTAL_FEE' 'CONTACT_1_TYPE' 'CONTACT_1_NAME'
'CONTACT_1_CITY' 'CONTACT_1_STATE' 'REPORTED_COST' 'COMMUNITY_AREA'
'WARD' 'XCOORDINATE' 'YCOORDINATE' 'VIOLATION_ID']
float64
```

`subtotal_paid = clean["SUBTOTAL_PAID"] total_fee = clean["TOTAL_FEE"] processing_time = clean["PROCESSING_TIME"] violation_id = clean["VIOLATION_ID"] reported_cost = clean["REPORTED_COST"] print((subtotal_paid - total_fee).mean())`

```
-114.41881703176709
```

`subtotal_paid.dtype violation_id.dtype fee_ratio = subtotal_paid / total_fee print(fee_ratio)`

```
5 1.0
10 1.0
12 1.0
13 1.0
16 1.0
...
636722 0.0
636723 1.0
636724 1.0
636725 1.0
636726 0.0
Length: 521637, dtype: float64
```

`permits_numpy = np.array( [ fee_ratio, subtotal_paid, total_fee, reported_cost, processing_time, violation_id, ], dtype=np.float32, ) permits_numpy.shape, permits_numpy.dtype, permits_numpy`

```
((6, 521637),
dtype('float32'),
array([[1.000000e+00, 1.000000e+00, 1.000000e+00, ..., 1.000000e+00,
1.000000e+00, 0.000000e+00],
[5.000000e+01, 5.000000e+01, 7.500000e+01, ..., 2.250000e+02,
5.500000e+02, 0.000000e+00],
[5.000000e+01, 5.000000e+01, 7.500000e+01, ..., 2.250000e+02,
5.500000e+02, 1.500000e+02],
[1.000000e+00, 7.500000e+02, 1.000000e+03, ..., 6.472000e+03,
1.753200e+04, 7.550000e+03],
[1.600000e+01, 0.000000e+00, 0.000000e+00, ..., 0.000000e+00,
0.000000e+00, 0.000000e+00],
[2.604962e+06, 5.925308e+06, 2.621932e+06, ..., 3.162157e+06,
3.162158e+06, 3.162159e+06]], dtype=float32))
```

`permits_torch = torch.from_numpy(permits_numpy) permits_torch.shape, permits_torch.dtype`

```
(torch.Size([6, 521637]), torch.float32)
```

`fee_ratio_actual = np.where( np.logical_and(fee_ratio >= 0, fee_ratio <= 1) ) processing_time_actual = np.where( np.less_equal(processing_time, 5000) ) fee_ratio.mean(), fee_ratio.var()`

```
(0.9685662035380701, 0.027418734640841984)
```

`# Now that we have the data organized and our variables isolated, # let's visualize the data fig = plt.figure(dpi=600) plt.xlabel("Paid Fee Ratio") plt.ylabel("Processing Time") plt.plot(permits_numpy[0], permits_numpy[4], "bo")`

`print(max(processing_time))`

```
4930.0
```

`print( "It's not clear from this visualization. Let's throw in more variables and build our Pytorch neural network." ) import torch.optim as optim import torch.nn as nn from sklearn import preprocessing`

```
It's not clear from this visualization. Let's throw in more variables and build our Pytorch neural network.
```

`frt = permits_torch[2] # fee ratio tensor rct = permits_torch[3] # reported cost tensor ptt = permits_torch[4] # processing time tensor frt = torch.tensor(frt).unsqueeze(1) ptt = torch.tensor(ptt).unsqueeze(1) ptt.shape`

```
torch.Size([521637, 1])
```

`n_samples = ptt.shape[0] n_val = int(0.2 * n_samples) shuffled_indices = torch.randperm(n_samples) train_indices = shuffled_indices[:-n_val] val_indices = shuffled_indices[-n_val:] train_indices, val_indices`

```
(tensor([390992, 455982, 478208, ..., 228113, 40310, 300034]),
tensor([135076, 107839, 12386, ..., 344239, 229882, 73529]))
```

`ptt_train = ptt[train_indices] frt_train = frt[train_indices] ptt_value = ptt[val_indices] frt_value = frt[val_indices] ptt_untrain = 0.1 * ptt_train ptt_unval = 0.1 * ptt_value linear_model = nn.Linear(1, 1) linear_model(ptt_unval)`

```
tensor([[ 1.4346],
[-0.3379],
[ 1.2130],
...,
[-0.3379],
[-0.3379],
[-0.1902]], grad_fn=<AddmmBackward>)
```

`linear_model.weight`

```
Parameter containing:
tensor([[0.7386]], requires_grad=True)
```

`linear_model.bias`

```
Parameter containing:
tensor([-0.3379], requires_grad=True)
```

`x = torch.ones(1) linear_model(x)`

```
tensor([0.4006], grad_fn=<AddBackward0>)
```

`x = torch.ones(10, 1) linear_model(x)`

```
tensor([[0.4006],
[0.4006],
[0.4006],
[0.4006],
[0.4006],
[0.4006],
[0.4006],
[0.4006],
[0.4006],
[0.4006]], grad_fn=<AddmmBackward>)
```

`linear_model = nn.Linear(1, 1) optimizer = optim.SGD(linear_model.parameters(), lr=1e-2) linear_model.parameters()`

```
<generator object Module.parameters at 0x0000022712B7DBA0>
```

`list(linear_model.parameters())`

```
[Parameter containing:
tensor([[-0.7984]], requires_grad=True),
Parameter containing:
tensor([-0.0445], requires_grad=True)]
```

`def training_loop(n_epochs, optimizer, model, loss_fn, ptt_train, ptt_value, frt_train, frt_value): for epoch in range(1, n_epochs + 1): pt_train = model(ptt_train) # <1> loss_train = loss_fn(pt_train, frt_train) pt_val = model(ptt_value) # <1> loss_val = loss_fn(pt_val, frt_value) optimizer.zero_grad() loss_train.backward() # <2> optimizer.step() if epoch == 1 or epoch % 1000 == 0: print(f"Epoch {epoch}, Training loss {loss_train.item():.4f}," f" Validation loss {loss_val.item():.4f}") def loss_fn(pt, frt): squared_diffs = (pt - frt)**2 return squared_diffs.mean() linear_model = nn.Linear(1, 1) # <1> optimizer = optim.Adam(linear_model.parameters(), lr=1e-2) training_loop( n_epochs = 3000, optimizer = optimizer, model = linear_model, loss_fn = loss_fn, ptt_train = ptt_untrain, ptt_value = ptt_unval, frt_train = frt_train, frt_value = frt_value) print() print(linear_model.weight) print(linear_model.bias)`

```
Epoch 1, Training loss 78878864.0000, Validation loss 59914564.0000
Epoch 1000, Training loss 78812888.0000, Validation loss 59848468.0000
Epoch 2000, Training loss 78761344.0000, Validation loss 59796900.0000
Epoch 3000, Training loss 78721384.0000, Validation loss 59756920.0000
Parameter containing:
tensor([[25.6018]], requires_grad=True)
Parameter containing:
tensor([28.6782], requires_grad=True)
```

`linear_model = nn.Linear(1, 1) optimizer = optim.SGD(linear_model.parameters(), lr=1e-2) training_loop( n_epochs=3000, optimizer=optimizer, model=linear_model, loss_fn=nn.MSELoss(), # <1> ptt_train=ptt_untrain, ptt_value=ptt_unval, frt_train=frt_train, frt_value=frt_value, ) print() print(linear_model.weight) print(linear_model.bias)`

```
Epoch 1, Training loss 78870432.0000, Validation loss 59906128.0000
Epoch 1000, Training loss 78098944.0000, Validation loss 59167068.0000
Epoch 2000, Training loss 78098944.0000, Validation loss 59167068.0000
Epoch 3000, Training loss 78098944.0000, Validation loss 59167068.0000
Parameter containing:
tensor([[23.6578]], requires_grad=True)
Parameter containing:
tensor([821.0441], requires_grad=True)
```

`seq_model = nn.Sequential( nn.Linear(1, 13), nn.Tanh(), nn.Linear(13, 1) # <1> ) # <2> seq_model`

```
Sequential(
(0): Linear(in_features=1, out_features=13, bias=True)
(1): Tanh()
(2): Linear(in_features=13, out_features=1, bias=True)
)
```

`[param.shape for param in seq_model.parameters()]`

```
[torch.Size([13, 1]), torch.Size([13]), torch.Size([1, 13]), torch.Size([1])]
```

`for name, param in seq_model.named_parameters(): print(name, param.shape)`

```
0.weight torch.Size([13, 1])
0.bias torch.Size([13])
2.weight torch.Size([1, 13])
2.bias torch.Size([1])
```

`from collections import OrderedDict seq_model = nn.Sequential( OrderedDict( [ ("hidden_linear", nn.Linear(1, 8)), ("hidden_activation", nn.Tanh()), ("output_linear", nn.Linear(8, 1)), ] ) ) seq_model`

```
Sequential(
(hidden_linear): Linear(in_features=1, out_features=8, bias=True)
(hidden_activation): Tanh()
(output_linear): Linear(in_features=8, out_features=1, bias=True)
)
```

`for name, param in seq_model.named_parameters(): print(name, param.shape)`

```
hidden_linear.weight torch.Size([8, 1])
hidden_linear.bias torch.Size([8])
output_linear.weight torch.Size([1, 8])
output_linear.bias torch.Size([1])
```

`seq_model.output_linear.bias`

```
Parameter containing:
tensor([-0.2399], requires_grad=True)
```

`optimizer = optim.SGD( seq_model.parameters(), lr=1e-3 ) # <1> training_loop( n_epochs=5000, optimizer=optimizer, model=seq_model, loss_fn=nn.MSELoss(), ptt_train=ptt_untrain, ptt_value=ptt_unval, frt_train=frt_train, frt_value=frt_value, ) print("output", seq_model(ptt_unval)) print("answer", frt_value) print("hidden", seq_model.hidden_linear.weight.grad)`

```
Epoch 1, Training loss 77971072.0000, Validation loss 59061600.0000
Epoch 1000, Training loss 77970968.0000, Validation loss 59060816.0000
Epoch 2000, Training loss 77970632.0000, Validation loss 59060944.0000
Epoch 3000, Training loss 77970728.0000, Validation loss 59060132.0000
Epoch 4000, Training loss 77975160.0000, Validation loss 59055040.0000
Epoch 5000, Training loss 77970688.0000, Validation loss 59059828.0000
output tensor([[1004.1236],
[ 654.4979],
[1004.1236],
...,
[ 654.4979],
[ 654.4979],
[ 654.5068]], grad_fn=<AddmmBackward>)
answer tensor([[175.],
[ 50.],
[240.],
...,
[150.],
[300.],
[575.]])
hidden tensor([[ 1.4657e+02],
[-5.6897e-02],
[ 6.7035e-02],
[ 4.1606e-01],
[-4.1232e-02],
[ 9.1959e-03],
[-4.4917e-01],
[ 4.8263e-01]])
```

`from matplotlib import pyplot as plt t_range = torch.arange(20.0, 90.0).unsqueeze(1) fig = plt.figure(dpi=600) plt.xlabel("Total Fees") plt.ylabel("Processing Time") plt.plot(ptt.numpy(), frt.numpy(), "o") plt.plot( t_range.numpy(), seq_model(0.1 * t_range).detach().numpy(), "c-", ) plt.plot( ptt.numpy(), seq_model(0.1 * ptt).detach().numpy(), "kx" )`

```
from sklearn.metrics import r2_score
```

Jupyter Notebook stalled before computing r2 score. In summary, this small neural net did not find compelling relations between fees and processing time. It is probably due to the small amount of parameters given the large dataset. Further iterations of the NN will include more paramaters and optimized built-ins to obtain a more accurate predication for processing time and violation probability.