Lesson 1
- Links
- Notes
- Jargon
- Homework
  - Questionnaire
- Further Research

Lesson 1

Notes from fastai lesson 1.

Notes

Discussed what’s needed to take this course (not much!).
Discussed history of AI and how deep neural networks came to be. Mostly news to me, 1/10.
Introduced Jupyter, ipywidgets, REPL. Mostly seen this before, 7/10.
Introduced ML (repeat, but still new, 4/10)
- take an input, process it, get an output.
- Samuel’s terminology: take inputs & weights into a model, generate results
- Add feedback: measure performance of the results (a metric), then change weights. Rinse & repeat.
- Different weights to the model allows it to do a different task
- Universal approximation theorem: theory that a neural network could solve any problem to any level of accuracy.
- Need a way to update weights - SGD: stochastic gradient descent - to update the weights
ML Limitations
- A model can only be created from data
- A model can only learn from patterns in the inputs
- A model can only make a prediction - actions happen externally
Labeled data is key and often missing - good part, bad part, etc.

It is important to note that a feedback loop can be created, resulting in a causal relationship where none existed before. Jeremy gave an example in lesson 1. This goes to ethics along ML and understanding inherent biases in your training dataset that you may not be aware of.
The fastai notebook

Intro to fastai. Discuss REPL, data sets, etc.
```
 from fastai.vision.all import *
```
```
 learner = cnn_learner(dls, resnet34, metrics=error_rate)
```
Here cnn_learner is a function that generates our model, dls is our dataset, resnet34 is our architecture, and error_rate is a loss function for feedback.

resnet34 is a predefined neural network trained on images that is free to use. 34 indicates it has 34 layers, more layers requires more GPU memory.

error_rate is computed on data not used in training, also known as a holdout set or validation set, to help avoid overfitting. Might need to increase the size of the holdout set to avoid overfitting. The ambiguiuty is actually important in model building, otherwise the model will only know how to recognize images it has seen before.
Other uses

segmentation: figuring out what every pixel in an image is (what label it corresponds to)

For a PWA, labels might be
- trace, screw, via, component
A larger training set might allow
- trace, via, capacitor, resistor, inductor, transistor, integrated circuit, etc.
Project idea: create an architecture for use in image segmentation that is trained on images of various PWAs (raspberry Pis, motherboards, etc.) to recognize features.

tabular data: fitting a model to predict salary based on a variety of parameters, predicting ratings a user might give a movie they haven’t seen based on previous ratings they have given (known as collaboration, used in recommendation engines)

Jargon

Keyword	Description
Architecture	The “program” we are running. Often synonymous with model, but represents its functional form..
Parameters	The “weights” into the “program” that alter its performance
Predictions	The output of our architecture, computed from independent variables which does not include labels
Labels	The targets or dependent variables, assumed to be true for a given prediction.
Loss	A metric of performance we measure our model by: how well did our prediction (computed from independent variables) match our labels (our dependent variables)?
Model	The combination of the parameters and architecture that can act on inputs to generate a prediction.
Inputs	The data on which the model acts to generate a prediction. No inputs (data) == No predictions!
Action	The decision made from reasoning about a given prediction.
Transfer learning	Using a pre-trained model for a task other than what it was originally intended for (via training)
Fine Tuning	A transfer learning technique that updates the parameters of a pretrained model by training for additional epochs using a different task from that used for pretraining. This is also known as fitting.
head	The last and newly added layer of a model which trains it on a particular data set. This replaces the previous layer when `cnn_learner` is used.
epoch	One complete training of the model on the complete dataset
Over fitting	When a model learns the unique characteristics of a dataset rather than a generalization, limiting its ability to do inference on new data.

Howard, Jeremy. Deep Learning for Coders with fastai and PyTorch . O’Reilly Media. Kindle Edition.

From the loss, we can /update/ the weight for a given input, thus allowing our system to learn.

Homework

Do the questionaire, run the notebooks, etc.

Questionnaire

Some of these questions were answered after watching both videos and reading chapter 1 of the book. Although this blog post is titled lesson 1, chapter 1 is covered in the first two lessons. Since I’m consuming the material one lesson at a time I’ll continue to follow this pattern.

(If you’re not sure of the answer to a question, try watching the next lesson and then coming back to this one, or read the first chapter of the book.)

Do you need these for deep learning?

Lots of math T / F
- False: Much of the math is taken for you thanks to libraries such as numpy.
Lots of data T / F
- False: Relatively small data sets can train world class models - no need for “Big Data”! Lots of expensive computers T / F
- False: You can use consumer grade hardware if you have it or take advantage of Google Cloud, AWS, Colab, etc.
A PhD T / F
- False: You will need some knowledge of programming and understanding of your data set, but a PhD is not needed.
Name five areas where deep learning is now the best in the world.
1. NLP (natural language processing)
2. Computer Vision
3. Medicine
4. Biology
5. Image generalization
6. Forecasting
7. Robotics (grip strength, movement)
What was the name of the first device that was based on the principle of the artificial neuron?

The Mark I Perceptron (sounds like something from Fallout!)

Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)?
- A set of processing units
- A state of activation
- An output function for each unit
- A pattern of connectivity between units
- A propagation rule for patterns to pass between units via a pattern of connectivity
- An activation rule that controls the output for a given input and current state of a unit
- A learning rule that modifies patterns of connectivity
- An environment to operate in (often overlooked)
What were the two theoretical misunderstandings that held back the field of neural networks?
- That because a single layer of neurons could not simulate an XOR, neural networks were a dead end
- DNN would be too big and too slow to be useful
What is a GPU?
- A Graphcis Processing Unit, designed for massively parallel execution of floating point vectors.
Open a notebook and execute a cell containing: 1+1. What happens?
- It computes and displays the output (2)
Follow through each cell of the stripped version of the notebook for this chapter. Before executing each cell, guess what will happen.
- Done in the notebook…
Complete the Jupyter Notebook online appendix.
- Done
Why is it hard to use a traditional computer program to recognize images in a photo?
- Requires a pixel by pixel search of the image, or something akin to a kernel to review pixel groupings, with significant comparison and cyclic complexity. Here, we tell the computer what to think, when to think it - we are explicit. This is converse to machine learning, which learns the patterns in the data and uses those patterns to recognize images in a photo.
What did Samuel mean by “weight assignment”?
- Weight assignments are variables used in a model to alter the performance of the program (model). The weight assignments must have an automatic means to be updated, so as to provide feedback to improve the performance of the model for its respective purpose.
What term do we normally use in deep learning for what Samuel called “weights”?
- We call them “parameters”.
Draw a picture that summarizes Samuel’s view of a machine learning model.

* results = model(inputs, weights)
* performance = results - actuals
* weights = performance(weights)

Essentially, the inputs of a model are used to generate results. Those results have a certain performance, which is improved by modifying the weights and repeating the operation.

Why is it hard to understand why a deep learning model makes a particular prediction?
- Because of the complexity of the “black box” - it would be necessary to instrument each layer of a neural network in order to understand why. In other words, the why is the entirety of the state machine, although some parts of the state machine may be better than others.
What is the name of the theorem that shows that a neural network can solve any mathematical problem to any level of accuracy?
- Universal approximation theorem (reminds me of using taylor series expansion to approximate various mathematical functions)
What do you need in order to train a model?
- You need a dataset and an architecture (a pretrained model)
How could a feedback loop impact the rollout of a predictive policing model?
- Inherent bias in the dataset could lead to increased policing of a particular population representative of the bias, resulting in additional policing, enhancing the existing dataset bias.
Do we always have to use 224×224-pixel images with the cat recognition model?
- No, but larger images or sizes may impact performance of the GPU (meaning it will take longer for the model to learn)
What is the difference between classification and regression?
- Classification predicts the category of a given input (think predicting finite enumerations)
- Regression predicts a future numeric value, like the temperature tomorrow
What is a validation set? What is a test set? Why do we need them?
- A validation or hold out set is a subset of a dataset that we do not train the model with, but we use to grade the model’s performance.
- A test set is a dataset we do not show even ourselves, to avoid introducing bias through EDA and model training.
- We need these to avoid overfitting or introducing bias into our models
What will fastai do if you don’t provide a validation set?
- It will automatically use a subset of your data as the validation set.
Can we always use a random sample for a validation set? Why or why not?
- If we used a random sample for validation, then repeated training (e.g. resampling the random data set and training the model on the resampled data) we could overfit our model.
- If we use a static sample for validation, then repeated training won’t be able to overfit, because it will never learn from the validation set.
- For example, with time series data we may hold out a specific period (say the last two weeks) of data as a validation set, rather than a random sampling of the dataset..
What is overfitting? Provide an example.
- Overfitting is the process of a model “memorizing” a given dataset. This can happen when our learning rate continues to increase and our error rate tends to zero. For example, if you overtrained a model on your entire dataset of cats, it would not be able to perform inference on new cat photos it has never seen before.
What is a metric? How does it differ from “loss”?
- A metric is all we care about: it is how well our model performs on our validation set.
How can pretrained models help?
- Pretrained models can be used to train a new model (transfer learning)
What is the “head” of a model?
- The head of the model is the layer who’s parameters are modified by training
What kinds of features do the early layers of a CNN find? How about the later layers?
- This depends, but a CNN might first learn things like basic shapes or color gradients. Later layers might recognize repeating patterns or more specific shapes.
Are image models only useful for photos?
- No! It’s possible to create images from many different types of data, including log files, executables, sound, etc. This can be done by transforming data to grayscale, or using existing visualization techniques (such as FFT).
What is an “architecture”?
- An architecture is a pretrained model from a specific dataset. resnet is an example of an architecture used in this book for computer vision.
What is segmentation?
- segmentation is the label of individual pixels of data in an image, e.g. applying the label “car” to the parts of an image with a car in it.
What is y_range used for? When do we need it?
- y_range
What are “hyperparameters”?
- A hyperparameter is a parameter about a parameter. When choosing new hyper parameters we must be careful or we may inadvertently introduce bias or overfitting the validation data, basically “finding the answer we want to hear”.
What’s the best way to avoid failures when using AI in an organization?
- To understand why a validation and test data set are needed and to ensure they are properly used when developing a model.

Further Research

I skipped these because I thought the answers were easy, but decided to return to them for completeness.

Why is a GPU useful for deep learning? How is a CPU different, and why is it less effective for deep learning?
- GPU’s contain thousands of massively parallel execution cores optimized for floating point arithmetic and matrix math. They contain relatively small amounts of extremely high bandwidth memory. These properties make them advantageous for DL, but inhibit them from being used for, say, running a Web Browser. CPU’s are general purpose computing devices, containing substantially fewer cores, slower memory speed, and tend to favor integer math over floating point math or matrix math (although Intel has tried to change this with AVX512…poorly). A key performance indicator for both CPU and GPUs that tend to lead to effective DL is the FLOP (floating point operations). GPUs typically have substantially higher FLOPs than CPUs.
Try to think of three areas where feedback loops might impact the use of machine learning. See if you can find documented examples of that happening in practice.
1. Elections: if a model with a feedback loop is used to predict election outcomes and has substantial influence on a population of voters, it could dissentivize voters from turning out for an election. While this might be most visible for American presedential races, it could affect less publicized elections such as those for ballot measures and Congress. I don’t think Five-Thirty-Eight is this…yet.
2. With the recent Facebook news and Russian influence in American Democracy, I think we’re beginning to see a trend of adversarial machine learning, which is exploiting a competitors production models to destablize or influence its outcome. I think nation states and other actors will continue to use feedback loops in active learning systems.
3. Search results. I Google’d a whiskey bottle to read a review before purchasing it. Now Google shows me whiskey reviews in my news feed, displacing other news articles. Reading these seems to reinforce the model and increases the number of articles I see. Since the number of articles I can read is limited, there is no way to “tone down” the model without telling it I’m simply not interested in whiskey. It’s kind of interesting because in general you implicitly train Google. You can’t implicitly untrain it (e.g. there is no decay function or “forgetting”, or it operates on a time horizon I havent seen yet).