博客
关于我
用线性回归计算缺失值
阅读量:352 次
发布时间:2019-03-04

本文共 2672 字,大约阅读时间需要 8 分钟。

  • Missing data

    Missing data can grocely be classified into three types:

    1. MCAR(Missing Completely At Random), which means that there is nothing systematic about why some date is missing. That is, there is no relationship between the fact that data is missing and either the observed or unobserved covariates.
    2. MAR(Missing At Random), resembles MCAR because there still is an element of randomness.
    3. MNAR(Missing Not At Random), implies that the fact that fata is missing is directly correlated with the value of the misssing data.
  • How to deal with missing data

    1. Just delete missing entries
    2. Replaceing missing values with the mean or median
    3. Linear Regression

      First, several predictors of the variable with missing values are identified using a correlation matrix. The best predictors are selected and used as independent variables in a regression equation.

      The variable with missing data is used as the dependent variable.

      Second, cases with complete data for the predictor variables are used to generate the regression equation;

      Third, the equation is then used to predict missing values for incomplete cases in an iterative process.

      以上是单变量线性回归

    4. 多元线性回归

      Linear regression has signigicant limits like:

      • It can’t easily match any data set that is non-linear
      • It can only be used to make predictions that fit within the range of the training data set
      • It can only be fit to data sets with a single dependent variables and a single independent variable

      This is where multiple regression comes in. It is specifically designed to create regressions on models with a single dependent variable and multiple independent variables.

      Equation for multiple regpression takes the form:

      y = b 1 ∗ x 1 + b 2 ∗ x 2 + . . . + b n ∗ x n + a y=b_1*x_1+b_2*x_2+...+b_n*x_n+a y=b1x1+b2x2+...+bnxn+a
      b i b_i bi coefficients;

      x i x_i xi independent variables; also called predictor variables

      y i y_i yi dependent vairables; also called criterion variable

      a a a a constant stating the value of the depnedent variable;

      How to fit a multiple regression model ?

      Similarly to minimized the sum of squared errors to find B in the linear regression, we minimize the sum of squared errors to find all the B terms in multiple regression.

      Exactly we use stochastic gradient descent(随机梯度下降).

      How to make sure the model fits the data well ?

      Use the same r 2 r^2 r2 value that was used for linear regression.

      r 2 r^2 r2 which is called the coefficient of determination, states the portion of change in the data set that is predicted by the model. It’s a value ranging from 0 to 1. With 0 stating that the model has no ability to predict the result and 1 stating that the model predicts the result perfectly.

  • References

转载地址:http://pjge.baihongyu.com/

你可能感兴趣的文章
Mysql学习总结(65)——项目实战中常用SQL实践总结
查看>>
Mysql学习总结(66)——设置MYSQL数据库编码为UTF-8
查看>>
Mysql学习总结(68)——MYSQL统计每天、每周、每月、每年数据 SQL 总结
查看>>
Mysql学习总结(69)——Mysql EXPLAIN 命令使用总结
查看>>
Mysql学习总结(6)——MySql之ALTER命令用法详细解读
查看>>
Mysql学习总结(70)——MySQL 优化实施方案
查看>>
Mysql学习总结(71)——MySQL 重复记录查询与删除总结
查看>>
Mysql学习总结(73)——MySQL 查询A表存在B表不存在的数据SQL总结
查看>>
Mysql学习总结(77)——温故Mysql数据库开发核心原则与规范
查看>>
Mysql学习总结(78)——MySQL各版本差异整理
查看>>
Mysql学习总结(79)——MySQL常用函数总结
查看>>
Mysql学习总结(7)——MySql索引原理与使用大全
查看>>
Mysql学习总结(80)——统计数据库的总记录数和库中各个表的数据量
查看>>
Mysql学习总结(81)——为什么MySQL不推荐使用uuid或者雪花id作为主键?
查看>>
Mysql学习总结(82)——MySQL逻辑删除与数据库唯一性约束如何解决?
查看>>
Mysql学习总结(83)——常用的几种分布式锁:ZK分布式锁、Redis分布式锁、数据库分布式锁、基于JDK的分布式锁方案对比总结
查看>>
Mysql学习总结(84)—— Mysql的主从复制延迟问题总结
查看>>
Mysql学习总结(85)——开发人员最应该明白的数据库设计原则
查看>>
MySQL学习笔记十七:复制特性
查看>>
mysql安装卡在最后一步解决方案(附带万能安装方案)
查看>>