1. What is Machine Learning?

1.1. Definition

Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed. <<一个较为宏观的定义>>
Tom Mitchell (1998).Well-posed Learning Problem:A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. <<一个较为practical的定义, 难点在于如何辨别一个problem中的T/P/E>>

1.2. Main goal(Aim)

Learning functions:
Features
Labels
Learning functions
Machine learning
x
f(x) = y
The relationship between x and y.
ML is the process of learning a function or hypothesis h(x), let the h(x) --> y,best approximates.
Generalisation(泛化): This hypothesis can then make predictions on new data points , such that is as close as possible to its real label y .
就人类而言，正是因为泛化能力，才让我们有可能用已知来对抗未知，以有限对抗无限.

1.3. Learning functions

(a): Original data points;
(b): Fit by three piecewise-linear segments (perfect fit);
- 三个线性分段函数；
- parameter: 3 x 2 = 6 variables;
(c): Fit by third-order polynomial (perfect fit);
- 三阶多项式；
- parameter: 3(polynomial) + 1(bias) = 4 variables;
(d): Fit by first-order polynomial (straight line - not perfect);
- 一阶多项式；
- parameter: 2 variables;

Rule of thumb: Data points ~= 10 x variables num
Tips:
1. 避免过拟合
2. 控制参数个数

2. Representation

2.1. Supervised/Unsupervised Learning

Machine Learning: is learning from experience. It’s also called supervised learning. E consists of features and labels, and P and T are well-defined.
Pattern Recognition: is finding patterns without experience. It’s also called unsupervised learning. E consists of only features, and P and T are defined in much broader terms of finding ‘interesting patterns’.

监督学习(supervised learning):

已知Label是什么，长什么样子，正确答案\结果是什么
输入输出之间存在较为明确的关系
比如：房屋面积和房价

无监督学习(unsupervised learning)

对Label的定义不是很明确，或者可以理解为很广泛
通过数据集中原本存在的某种关系进行归类(Clustering)
比如：
- 对基因组进行归类
- 从音频中分出音乐和对话

2.2. Representation view on ML (Pedro Domingos, 2012)

Machine Learning = Representation + Evaluation + Optimisation

Representation: 表征, a way of describing the problem and data
Evaluation: 评估，类似于Tom Mitchell提出的measure P
Optimisation：优化，使预测值接近真实值的算法

2.3. Classification

Classification is a ML task where T has a discrete set of outcomes.

Often classification is binary: [0,1]
Examples:
- face detection
- smile detection
- spam classification
- hot/cold

2.4. Regression

Regression is a ML task where T has a real-valued outcome on some continuous sub-space:

Examples:
- age estimation
- stock value prediction
- temperature prediction
- energy consumption prediction

3. Features, labels, tasks

3.1. Features and Labels

Data points or instances make up the data used to learn a hypothesis h or find a pattern g
In Machine Learning, a data point consists of feature/label tuples {x, y}
A single data point comes from one measurement/observation
Many data points together make a dataset

3.2. Labels

Labels y are the values that h(x) aims to predict.

Obtaining labels is usually an arduous task
- Often manual
- Repetitive
- Complicated experiments Difficult to obtain data
Example:
- Facial expressions of pain
- Impact of diet on astronauts in space
- Predictions of house prices

3.3. Features/Attributes

Features/Attributes are measurable values of variables for which some form of pattern exists, that can be used to infer the associated label y.

Sender domain in spam detection
Mouth corner location in smile detection Temperature in forest fire prediction
Pixel value in face detection
Head pose estimation from facial point locations

3.4. Features Definition

For a given problem, all data points must have the same, fixed-length(a row-vector with d elements) set of features x :
A dataset with n data points is then denoted as a n x d matrix:

3.5. Labels Definition

For a given problem with a singular task, the set of labels y accompanying the set of features X is given as:

4. Linear Regression Intro

4.1. Simplest Example -Latitude and Temperature

4.2. Training Set and Meaning of Symbol

4.3. Learning Flow

Univariate Linear Regression: One feature.

4.4. Training Algorithm - Minimises the Cost Function

Given a model h with solution space and a training set {X,y}, a learning algorithm finds the solution that minimises the cost function J(S).

4.5. Intrinsic/Hyper Parameters

Intrinsic parameters
- Can be efficiently learned on the training set
- Large in number
- E.g. weights in linear regression or Artificial Neural Network
- 固有参数是指模型可以通过数据自动学习出的变量；Eg. 深度学习的权重、偏差
Hyper-parameters
- Must be learned by establishing generalisation error
- No efficient search possible
- Smaller in number
- E.g. the number of nodes in an ANN or the degree of a polynomial linear regression model
- 超参数是用来确定模型的参数，超参数不同，模型不同，一般是根据经验确定的；Eg. 学习速率，迭代次数，层数，每层神经元的个数等

4.6. Brute Force Search(暴力查找)

$表示参数集$
Note-思考问题:

You can’t do this for hyper-parameters using the above formulation (why not?)
Clearly you can’t search all possible values (why not?)
This is a very small formula, but there are some hidden caveats. Can you write matlab/pseudo code for this?

Features	Labels	Learning functions	Machine learning
x	f(x) = y	The relationship between x and y.	ML is the process of learning a function or hypothesis h(x), let the h(x) --> y,best approximates.