There are no items in your cart
Add More
Add More
Item Details | Price |
---|
Learn how to model curvy relationships in your data with this powerful technique.
We've seen how Simple and Multiple Linear Regression try to fit a straight line (or a flat plane) through our data points. But what happens when the relationship between our input (X) and output (Y) isn't a straight line? What if it curves?
Think about things like: the relationship between experience level and salary (often curves upwards faster later), the path of a thrown ball, or how temperature affects crop yield. These often show non-linear patterns.
This is where Polynomial Regression comes to the rescue! It's a type of regression analysis that allows us to model these curved relationships by using polynomial terms (like x², x³, etc.) in our equation.
Main Technical Concept: Polynomial Regression models the relationship between the independent variable (X) and dependent variable (Y) as an nth degree polynomial. It's used when data shows a curved pattern where linear regression fails.
Remember the simple linear equation: y = b₀ + b₁x
. This can only draw straight lines.
Polynomial Regression extends this by adding higher powers of the independent variable x
:
y = b₀ + b₁x + b₂x² + b₃x³ + ... + bnxn
Where:
x
.By adding terms like x²
(which creates a parabola), x³
(which creates an S-shape), etc., the model can create much more flexible curves that can fit non-linear data better.
Think of it like this: a degree 1 polynomial is a straight ruler. A degree 2 polynomial is like bending the ruler once. A degree 3 polynomial is like bending it twice, and so on. More bends allow it to fit more complex shapes.
Interestingly, even though the *relationship* between X and Y is curved (non-linear), it's still considered a type of linear model in a statistical sense. Why? Because the equation is linear *with respect to the coefficients* (b₀, b₁, b₂, ...
). We are still just finding the best weights for each term, even if those terms involve powers of X. This means we can still use the same `LinearRegression` techniques from libraries like Scikit-learn to solve for the coefficients!
The key trick is to transform our original independent variable(s) into polynomial features before fitting a standard linear regression model.
X
and dependent variable y
.PolynomialFeatures
class.fit_transform
method on your original X
data. This will generate a new array X_poly
containing the original column(s), plus columns for x², x³, etc., up to the specified degree.PolynomialFeatures
automatically adds a column of ones for the intercept term.LinearRegression
.X_poly
) and the original target variable (y
): model.fit(X_poly, y)
.import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# 1. Load dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values # Level (Independent) - keep as 2D array
y = dataset.iloc[:, -1].values # Salary (Dependent)
# (Optional) Fit Linear Regression for comparison
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# 2. Create Polynomial Features (e.g., degree 4)
poly_reg = PolynomialFeatures(degree=4) # Choose the degree
X_poly = poly_reg.fit_transform(X) # Transform X into X, X^2, X^3, X^4
# Print X_poly to see the transformation (includes column of 1s for intercept)
print("Original X:\n", X[:3])
print("\nPolynomial Features (X_poly, degree 4):\n", X_poly[:3])
# 3. Train Linear Regression on Polynomial Features
poly_lin_reg = LinearRegression()
poly_lin_reg.fit(X_poly, y) # Fit model using the transformed features
# 4. Visualize Results
# --- Visualize Linear Fit ---
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='#ec4899', label='Actual Salary') # Pink dots
plt.plot(X, lin_reg.predict(X), color='#3b82f6', label='Linear Fit') # Blue line
plt.title('Salary vs Level (Linear Regression)')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.legend()
# plt.show()
# --- Visualize Polynomial Fit (smoother curve) ---
# Create smoother X values for plotting the curve
X_grid = np.arange(min(X), max(X), 0.1)
X_grid = X_grid.reshape((len(X_grid), 1))
# Transform the smooth X values using the SAME poly_reg object
X_grid_poly = poly_reg.transform(X_grid)
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='#ec4899', label='Actual Salary') # Pink dots
plt.plot(X_grid, poly_lin_reg.predict(X_grid_poly), color='#14b8a6', label=f'Polynomial Fit (Degree 4)') # Teal curve
plt.title('Salary vs Level (Polynomial Regression)')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.legend()
# plt.show()
# 5. Predicting a new result (e.g., for level 6.5)
new_level = [[6.5]]
# Transform the new level using the fitted poly_reg
new_level_poly = poly_reg.transform(new_level)
predicted_salary = poly_lin_reg.predict(new_level_poly)
print(f"\nPredicted salary for Level 6.5: ${predicted_salary[0]:,.2f}")
A key decision in Polynomial Regression is choosing the degree of the polynomial.
How do you find the right degree?
Generally, start with degree 2, then try 3 or 4, and compare results. Very high degrees are often a sign of overfitting.
Issue | Solution | Prevention/Best Practice |
---|---|---|
Linear regression fits poorly on visibly curved data. | Switch to Polynomial Regression. Start with `degree=2` and increase if needed. | Always visualize your data first with a scatter plot! |
Model fits training data perfectly (very low error) but performs terribly on test data. | The polynomial degree is likely too high, causing overfitting. | Reduce the degree. Use cross-validation or check test set performance to find the optimal degree. Consider regularization if applicable. |
Code throws errors about input shapes or data types. | Ensure X is a 2D array (even for one feature, use `.values.reshape(-1, 1)` or `iloc[:, 1:2]`). Ensure data is numeric. Verify column indices in `iloc`. | Inspect data shapes (`.shape`) and types (`.dtypes`) after loading and transformations. |
Model predicts nonsensical values (e.g., negative salary). | Could be due to extreme overfitting with a high degree, or the chosen polynomial doesn't make sense outside the observed data range. | Choose a more reasonable degree. Be cautious extrapolating far beyond your training data range. |
PolynomialFeatures
and then fitting a standard LinearRegression
model to these new features.Interview Question
Question 1: When would you choose Polynomial Regression over Simple Linear Regression?
You would choose Polynomial Regression when a scatter plot of the independent variable (X) and dependent variable (Y) clearly shows a curved (non-linear) relationship, rather than a straight-line trend. Simple Linear Regression assumes a linear relationship and will fit poorly to curved data.
Question 2: What does the "degree" of a polynomial regression model signify?
The degree signifies the highest power of the independent variable (X) included in the regression equation (e.g., a degree of 3 means the equation includes terms up to X³). A higher degree allows the model to fit more complex curves.
Interview Question
Question 3: How does Scikit-learn's `PolynomialFeatures` help in implementing Polynomial Regression?
`PolynomialFeatures` automatically generates the higher-order terms (like X², X³, etc.) from the original independent variable(s) X. It transforms the input feature matrix into a new matrix containing these polynomial terms, which can then be fed into a standard `LinearRegression` model to fit the polynomial relationship.
Question 4: What is the main risk associated with choosing a very high degree for a polynomial regression model?
The main risk is overfitting. A very high-degree polynomial can create an extremely complex curve that fits the training data points (including noise) almost perfectly but fails to generalize to new, unseen data. This results in low training error but high testing error.
Interview Question
Question 5: Even though Polynomial Regression models curves, why is it still often solved using a `LinearRegression` object in libraries like Scikit-learn?
Because the model equation (y = b₀ + b₁x + b₂x² + ...
) is still linear with respect to the coefficients (b₀, b₁, b₂, ...). We treat the polynomial terms (x, x², x³) as separate features. The goal is still to find the best linear combination of these features (by finding the optimal coefficients) to predict y, which is exactly what `LinearRegression` does.