Tìm p-value (mean) trong scikit-learn LinearRegression

28

scikit-học của hồi quy tuyến tính không tính toán thông tin này nhưng bạn có thể dễ dàng mở rộng các lớp để làm điều đó:

from sklearn import linear_model 
from scipy import stats 
import numpy as np 


class LinearRegression(linear_model.LinearRegression): 
    """ 
    LinearRegression class after sklearn's, but calculate t-statistics 
    and p-values for model coefficients (betas). 
    Additional attributes available after .fit() 
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1]) 
    which is (n_features, n_coefs) 
    This class sets the intercept to 0 by default, since usually we include it 
    in X. 
    """ 

    def __init__(self, *args, **kwargs): 
     if not "fit_intercept" in kwargs: 
      kwargs['fit_intercept'] = False 
     super(LinearRegression, self)\ 
       .__init__(*args, **kwargs) 

    def fit(self, X, y, n_jobs=1): 
     self = super(LinearRegression, self).fit(X, y, n_jobs) 

     sse = np.sum((self.predict(X) - y) ** 2, axis=0)/float(X.shape[0] - X.shape[1]) 
     se = np.array([ 
      np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X)))) 
                for i in range(sse.shape[0]) 
        ]) 

     self.t = self.coef_/se 
     self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1])) 
     return self

bị đánh cắp từ here.

Bạn nên xem statsmodels cho loại phân tích thống kê này bằng Python.

Nguồn

2015-01-13 17:54:43 elyase

+0

Bất kỳ lý do không đóng góp trở lại này? Tuy nhiên dường như không được bao gồm và nó có vẻ như rất nhiều peeps đang quan tâm đến nó – Sammaron

4

Đoạn mã trên không thực sự làm việc. Lưu ý rằng sse là một vô hướng, và sau đó nó cố gắng lặp qua nó. Mã sau đây là một phiên bản sửa đổi. Không đáng kinh ngạc, nhưng tôi nghĩ nó hoạt động nhiều hay ít.

class LinearRegression(linear_model.LinearRegression): 

    def __init__(self,*args,**kwargs): 
     # *args is the list of arguments that might go into the LinearRegression object 
     # that we don't know about and don't want to have to deal with. Similarly, **kwargs 
     # is a dictionary of key words and values that might also need to go into the orginal 
     # LinearRegression object. We put *args and **kwargs so that we don't have to look 
     # these up and write them down explicitly here. Nice and easy. 

     if not "fit_intercept" in kwargs: 
      kwargs['fit_intercept'] = False 

     super(LinearRegression,self).__init__(*args,**kwargs) 

    # Adding in t-statistics for the coefficients. 
    def fit(self,x,y): 
     # This takes in numpy arrays (not matrices). Also assumes you are leaving out the column 
     # of constants. 

     # Not totally sure what 'super' does here and why you redefine self... 
     self = super(LinearRegression, self).fit(x,y) 
     n, k = x.shape 
     yHat = np.matrix(self.predict(x)).T 

     # Change X and Y into numpy matricies. x also has a column of ones added to it. 
     x = np.hstack((np.ones((n,1)),np.matrix(x))) 
     y = np.matrix(y).T 

     # Degrees of freedom. 
     df = float(n-k-1) 

     # Sample variance.  
     sse = np.sum(np.square(yHat - y),axis=0) 
     self.sampleVariance = sse/df 

     # Sample variance for x. 
     self.sampleVarianceX = x.T*x 

     # Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root. ugly) 
     self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I) 

     # Standard erros for the difference coefficients: the diagonal elements of the covariance matrix. 
     self.se = self.covarianceMatrix.diagonal()[1:] 

     # T statistic for each beta. 
     self.betasTStat = np.zeros(len(self.se)) 
     for i in xrange(len(self.se)): 
      self.betasTStat[i] = self.coef_[0,i]/self.se[i] 

     # P-value for each beta. This is a two sided t-test, since the betas can be 
     # positive or negative. 
     self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)

Nguồn

2015-01-16 00:46:25 Alex

6

Bạn có thể sử dụng sklearn.feature_selection.f_regression.

click here for the scikit-learn page

Nguồn

2016-01-24 23:44:19

+0

Vì vậy, đó là những chiếc F-kiểm tra? Tôi nghĩ rằng các giá trị p cho hồi quy tuyến tính thường là cho mỗi biến hồi quy cá nhân, và nó là một thử nghiệm so với null của hệ số là 0? Giải thích thêm về chức năng sẽ là cần thiết cho một câu trả lời hay. – wordsforthewise

29

này được loại overkill nhưng chúng ta hãy cho nó một đi. Đầu tiên cho phép sử dụng statsmodel để tìm hiểu những gì các p-giá trị nên

import pandas as pd 
import numpy as np 
from sklearn import datasets, linear_model 
from sklearn.linear_model import LinearRegression 
import statsmodels.api as sm 
from scipy import stats 

diabetes = datasets.load_diabetes() 
X = diabetes.data 
y = diabetes.target 

X2 = sm.add_constant(X) 
est = sm.OLS(y, X2) 
est2 = est.fit() 
print(est2.summary())

và chúng tôi nhận

      OLS Regression Results        
============================================================================== 
Dep. Variable:      y R-squared:      0.518 
Model:       OLS Adj. R-squared:     0.507 
Method:     Least Squares F-statistic:      46.27 
Date:    Wed, 08 Mar 2017 Prob (F-statistic):   3.83e-62 
Time:      10:08:24 Log-Likelihood:    -2386.0 
No. Observations:     442 AIC:        4794. 
Df Residuals:      431 BIC:        4839. 
Df Model:       10           
Covariance Type:   nonrobust           
============================================================================== 
       coef std err   t  P>|t|  [0.025  0.975] 
------------------------------------------------------------------------------ 
const  152.1335  2.576  59.061  0.000  147.071  157.196 
x1   -10.0122  59.749  -0.168  0.867 -127.448  107.424 
x2   -239.8191  61.222  -3.917  0.000 -360.151 -119.488 
x3   519.8398  66.534  7.813  0.000  389.069  650.610 
x4   324.3904  65.422  4.958  0.000  195.805  452.976 
x5   -792.1842 416.684  -1.901  0.058 -1611.169  26.801 
x6   476.7458 339.035  1.406  0.160 -189.621 1143.113 
x7   101.0446 212.533  0.475  0.635 -316.685  518.774 
x8   177.0642 161.476  1.097  0.273 -140.313  494.442 
x9   751.2793 171.902  4.370  0.000  413.409 1089.150 
x10   67.6254  65.984  1.025  0.306  -62.065  197.316 
============================================================================== 
Omnibus:      1.506 Durbin-Watson:     2.029 
Prob(Omnibus):     0.471 Jarque-Bera (JB):    1.404 
Skew:       0.017 Prob(JB):      0.496 
Kurtosis:      2.726 Cond. No.       227. 
==============================================================================

Ok, chúng ta hãy tái sản xuất này. Đó là loại overkill như chúng tôi gần như tái tạo một phân tích hồi quy tuyến tính bằng cách sử dụng Đại số Matrix. Nhưng cái quái gì vậy.

lm = LinearRegression() 
lm.fit(X,y) 
params = np.append(lm.intercept_,lm.coef_) 
predictions = lm.predict(X) 

newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X)) 
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns)) 

# Note if you don't want to use a DataFrame replace the two lines above with 
# newX = np.append(np.ones((len(X),1)), X, axis=1) 
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0])) 

var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal()) 
sd_b = np.sqrt(var_b) 
ts_b = params/ sd_b 

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b] 

sd_b = np.round(sd_b,3) 
ts_b = np.round(ts_b,3) 
p_values = np.round(p_values,3) 
params = np.round(params,4) 

myDF3 = pd.DataFrame() 
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilites"] = [params,sd_b,ts_b,p_values] 
print(myDF3)

Và điều này cho chúng ta.

Coefficients Standard Errors t values Probabilites 
0  152.1335   2.576 59.061   0.000 
1  -10.0122   59.749 -0.168   0.867 
2  -239.8191   61.222 -3.917   0.000 
3  519.8398   66.534  7.813   0.000 
4  324.3904   65.422  4.958   0.000 
5  -792.1842   416.684 -1.901   0.058 
6  476.7458   339.035  1.406   0.160 
7  101.0446   212.533  0.475   0.635 
8  177.0642   161.476  1.097   0.273 
9  751.2793   171.902  4.370   0.000 
10  67.6254   65.984  1.025   0.306

Vì vậy, chúng tôi có thể tạo lại các giá trị từ mô hình thống kê.

Nguồn

2017-03-08 17:17:57 JARH

+1

nó có nghĩa là var_b của tôi là tất cả Nans? Có bất kỳ lý do cơ bản nào tại sao phần đại số tuyến tính không thành công? – famargar

+0

Thực sự khó đoán là tại sao có thể. Tôi sẽ xem xét cấu trúc dữ liệu của bạn và so sánh nó với ví dụ. Điều đó có thể cung cấp một đầu mối. – JARH

+0

Có vẻ như 'code' np.linalg.inv đôi khi có thể trả về một kết quả ngay cả khi ma trận là phi invertable. Đó có thể là vấn đề. – JARH

3

Bạn có thể sử dụng scipy cho giá trị p. Mã này là từ tài liệu scipy.

>>> from scipy import stats 
>>> import numpy as np 
>>> x = np.random.random(10) 
>>> y = np.random.random(10) 
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

Nguồn

2017-10-24 13:57:03

Tìm p-value (mean) trong scikit-learn LinearRegression

Trả lời

Các vấn đề liên quan