Data Frame • Computer Engineer

บทที่ 2: Data Preparation

2.1 ไลบรารี่สำคัญ

NumPy: ไลบรารี่สำหรับการคำนวณเชิงคณิตศาสตร์ มีประสิทธิภาพสูงในการจัดการกับอาเรย์และเมตริกซ์
Scikit-learn: ไลบรารี่สำหรับ Machine Learning ที่รองรับการแบ่งประเภทข้อมูล การแบ่งกลุ่ม และการวิเคราะห์การถดถอย
Pandas: ไลบรารี่สำหรับการจัดการและวิเคราะห์ข้อมูลแบบมิติเดียวและหลายมิติ
Seaborn และ Matplotlib: สำหรับการสร้างกราฟและการวิเคราะห์ข้อมูลด้วยภาพ

2.2 การใช้งาน DataFrame

DataFrame คือข้อมูล 2 มิติที่จัดเก็บในรูปแบบตาราง ประกอบด้วยแถวและคอลัมน์
การอ่านและเขียนไฟล์ CSV โดยใช้ Pandas
- pd.read_csv('filename.csv') สำหรับการอ่านข้อมูล
- df.to_csv('filename.csv') สำหรับการเขียนข้อมูล

Code

import pandas as pd
data = [['Alice',21,'F'],
        ['Bob',32,'M'],
        ['Carol',27,'F'],
        ['David',24,'M']]

cols = ['Name','Age','Gender']
df = pd.DataFrame(data,columns = cols)
print('df =\n',df)

# out to file
# df.to_csv('testfile.csv',index=False,columns=cols)
# df2 = pd.read_csv('testfile.csv')
# print('df2 =\n',df2)

2.3 การ Plot Graph

การใช้ matplotlib.pyplot สำหรับสร้างกราฟ
การใช้ Seaborn สำหรับการสร้างกราฟที่มีความสัมพันธ์ เช่น sns.lmplot

Code 1

import numpy as np
import matplotlib.pyplot as plt
wide = [1.0, 3, 2, 2.2, 1.2, 0.9, 2.9, 1.9, 0.85, 2.1, 3.1, 3.3, 1.1, 2.8, 2]
long = [15, 30, 3, 2.1, 14, 13, 33, 3.5, 16, 2.6, 28, 32, 12, 29, 4]

x = np.array(wide)
y = np.array(long)

x_2 = [1.0,6.0]
y_2 = [1.0,30]

plt.scatter(x,y)
plt.plot(x_2,y_2,'b')
plt.grid()
plt.xlabel('wide')
plt.ylabel('long')
plt.show()

Code 2

import numpy as np
import matplotlib.pyplot as plt

# Data
n_high = np.array([170, 165, 172, 161, 180])
n_weight = np.array([75, 80, 64, 68, 82])

c_high = np.array([120, 125, 112, 118, 106])
c_weight = np.array([44, 48, 51, 41, 57])

# Normal data points
plt.scatter(n_weight,n_high, color='blue')

# Custom data points
plt.scatter(c_weight,c_high, color='green')

# plot
plt.axvline(x = 60.5,color='g')
plt.axhline(y = 143,color='r')

# Add titles and labels
plt.title('Height vs Weight Comparison')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.grid(True)

# Show plot
plt.show()

2.4 Regression Analysis

การสร้างโมเดลด้วย LinearRegression จาก Scikit-learn
การพยากรณ์ค่าจากโมเดล

บทที่ 3: Linear Regression

3.1 ความหมายของ Regression

การวิเคราะห์ความสัมพันธ์ระหว่างตัวแปรอินพุต (Predictor) และตัวแปรผลลัพธ์ (Response)
ใช้สำหรับการพยากรณ์ (Prediction)

3.2 Mathematical Model

สมการพื้นฐาน: ( Y = F(X) ) เช่น ( Y = 67.45 + 0.214 times X )
ใช้ Gradient Descent ในการหาค่าพารามิเตอร์ ( a, b )

3.3 การวัดประสิทธิภาพโมเดล (Model Evaluation)

การใช้ R-Squared เพื่อประเมินความแม่นยำของโมเดล
- สูตร: ( R^2 = 1 - frac{text{Sum of Residual Errors}}{text{Total Sum of Squares}} )

3.4 ตัวอย่างและแบบฝึกหัด

ใช้ Python คำนวณค่าพารามิเตอร์ ( a, b ) และประเมินโมเดล
วาดกราฟความสัมพันธ์ระหว่างตัวแปร
การคาดการณ์ผลลัพธ์ เช่น การขายสินค้าเมื่อปริมาณฝนตกแตกต่างกัน

Code

import numpy as np

a = 1.31
b = 0.65
alpha = 0.1

x= np.array([0,1,2,3])
y= np.array([1,3,5,7])
h = a*x + b

c = np.power((y - h),2) * 0.5

out1 = -(y-h)
out2 = -(y-h)*x
print('h          = ',h)
print('se         = ',c)
print('-(y-h)     = ',out1)
print('-((y-h)*x) = ',out2)

# sum
Sumout1 = np.sum(out1)
Sumout2 = np.sum(out2)
print('sum out1   = ',Sumout1)
print('sum out2   = ',Sumout2)

# a,b new
anew = a - alpha * Sumout2
bnew = b - alpha * Sumout1

print('anew       =   {:.2f}'.format(anew))
print('bnew       =   {:.2f}'.format(bnew))

Output

h          =  [0.65 1.96 3.27 4.58]
se         =  [0.06125 0.5408  1.49645 2.9282 ]
-(y-h)     =  [-0.35 -1.04 -1.73 -2.42]
-((y-h)*x) =  [-0.   -1.04 -3.46 -7.26]
sum out1   =  -5.54
sum out2   =  -11.76
anew       =  2.49
bnew       =  1.20