Kako započeti s strojnim učenjem za otprilike 10 minuta

S porastom strojnog učenja u industriji, potreba za alatom koji vam može pomoći da brzo prođete kroz proces postala je vitalna. Python, zvijezda u usponu u tehnologiji strojnog učenja, često je prvi izbor koji vam donosi uspjeh. Dakle, vodič za strojno učenje s Pythonom zaista je potreban.

Uvod u strojno učenje s Pythonom

Pa, zašto Python? Prema mom iskustvu, Python je jedan od programskih jezika koji se najlakše uči. Potrebno je brzo ponoviti postupak, a podatkovni znanstvenik ne mora dobro poznavati jezik, jer ga mogu vrlo brzo dohvatiti.

Koliko lako

for anything in the_list: print(anything)

Tako lako . Sintaksa je usko povezana s engleskim jezikom (ili ljudskim jezikom, a ne strojem). A nema glupih kovrčava zagrada koje zbunjuju ljude. Imam kolegu koji radi na osiguranju kvalitete, a ne softverski inženjer, i ona može napisati Python kôd u roku od jednog dana na proizvodnoj razini. (Stvarno!)

Dakle, graditelji knjižnica o kojima ćemo raspravljati u nastavku izabrali su Python za odabir jezika. A kao analitičar podataka i znanstvenik, možemo samo koristiti njihova remek-djela koja će nam pomoći da dovršimo zadatke. To su nevjerojatne knjižnice, koje su nužne za strojno učenje s Pythonom .

  1. Numpy

Poznata biblioteka numeričke analize. Pomoći će vam u mnogim stvarima, od izračunavanja medijana distribucije podataka do obrade višedimenzionalnih nizova.

2. Pande

Za obradu CSV datoteka. Naravno, trebat ćete obraditi neke tablice i vidjeti statistiku, a ovo je pravi alat koji želite koristiti.

3. Matplotlib

Nakon što podatke pohranite u Pandasove podatkovne okvire, možda će vam trebati neke vizualizacije kako biste razumjeli više o podacima. Slike su i dalje bolje od tisuća riječi.

4. Seaborn

Ovo je također još jedan alat za vizualizaciju, ali više usmjeren na statističku vizualizaciju. Stvari poput histograma, tortnih dijagrama ili krivulja ili možda korelacijskih tablica.

5. Scikit-Learn

Ovo je posljednji šef Strojnog učenja s Pythonom. TAKOZVANO Strojno učenje s Pythonom je ovaj tip. Scikit-Learn. Ovdje su sve stvari koje su vam potrebne od algoritama do poboljšanja.

6. Tenzor i Pytorch

O ovoj dvojici ne pričam previše. Ali ako ste zainteresirani za dubinsko učenje, pogledajte ih, bit će vrijedno vašeg vremena. (Sljedeći ću put dati još jedan vodič o dubinskom učenju, budite uz nas!)

Projekti strojnog učenja Python

Naravno, samo čitanje i učenje neće vas dovesti tamo kamo trebate ići. Treba vam stvarna praksa. Kao što sam rekao na svom blogu, učenje alata je besmisleno ako ne uskočite u podatke. Upoznajem vas s mjestom na kojem možete lako pronaći Python projekte strojnog učenja.

Kaggle je platforma na kojoj možete izravno zaroniti u podatke. Riješit ćete projekte i postat ćete stvarno dobri u Strojnom učenju. Nešto što bi vas moglo više zainteresirati za to: natjecanja u strojnom učenju koja održavaju mogu donijeti nagradu do 100.000 USD. I možda biste htjeli okušati sreću. Haha.

Ali, najvažnije nije novac - to je stvarno mjesto gdje možete pronaći Strojno učenje s Python projektima. Postoji mnogo projekata koje možete isprobati. Ali ako ste novak, a pretpostavljam da jeste, morat ćete se pridružiti ovom natjecanju.

Evo primjera projekta koji ćemo koristiti u donjem vodiču:

Titanic: Strojno učenje iz katastrofe

Da, zloglasni Titanic. Tragična katastrofa 1912. godine, koja je oduzela živote 1502 osobe od 2224 putnika i posade. Ovo natjecanje u Kaggleu (ili mogu reći tutorial) daje vam stvarne podatke o katastrofi. A vaš je zadatak objasniti podatke kako biste mogli predvidjeti je li neka osoba preživjela ili nije tijekom incidenta.

Strojno učenje s vodičem za Python

Prije nego što zađemo duboko u podatke Titanica, instalirajmo neke alate koji su ti potrebni.

Naravno, Python. Prvo ga morate instalirati s službenog web mjesta Python. Morate instalirati verziju 3.6+ da biste bili u toku s bibliotekama.

Nakon toga trebate instalirati sve knjižnice putem Python pipa. Pip treba instalirati automatski distribucijom Pythona koju ste upravo preuzeli.

Zatim instalirajte stvari koje su vam potrebne putem pipa. Otvorite terminal, naredbenu liniju ili Powershell i napišite sljedeće:

pip install numpypip install pandaspip install matplotlibpip install seabornpip install scikit-learnpip install jupyter

Pa, sve izgleda dobro. Ali pričekajte, što je jupyter? Jupyter je skraćenica za Julia, Python i R, dakle Jupytr. Ali to je neobična kombinacija riječi, pa su je promijenili u samo Jupyter. To je poznata bilježnica u koju možete interaktivno pisati Python kôd.

Samo upišite jupyter bilježnicu u svoj terminal i otvorit ćete stranicu preglednika poput ove:

Napišite kôd unutar zelenog pravokutnika i možete interaktivno pisati i procjenjivati ​​Python kôd.

Sada ste instalirali sve alate. Krenimo!

Istraživanje podataka

Prvi korak je istraživanje podataka. Podatke morate preuzeti sa stranice Titanic u Kaggleu. Zatim izvađene podatke stavite u mapu u kojoj pokrećete svoju Jupyterovu bilježnicu.

Zatim uvezite potrebne knjižnice:

import numpy as np import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport warningswarnings.filterwarnings('ignore')%matplotlib inline

Zatim učitajte podatke:

train_df=pd.read_csv("train.csv")train_df.head()

Vidjet ćete otprilike ovako:

To su naši podaci. Sadrži sljedeće stupce:

  1. PassengerId, identifikator putnika
  2. Preživio, bez obzira je li preživio ili ne
  3. Klasa, klasa usluge, možda je 1 ekonomija, 2 je posao i 3 je prva klasa
  4. Ime, ime putnika
  5. Seks
  6. Dob
  7. Sibsp, ili braća i sestre i supružnici, broj braće i sestara i supružnika na brodu
  8. Parch, ili roditelji i djeca, broj njih na brodu
  9. Ulaznica, detalj ulaznice
  10. Cabin, their cabin. NaN means unknown
  11. Embarked, the origin of embarkation, S for Southampton, Q for Queenstown, C for Cherbourg

While exploring data, we often find missing data. Let’s see them:

def missingdata(data): total = data.isnull().sum().sort_values(ascending = False) percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False) ms=pd.concat([total, percent], axis=1, keys=['Total', 'Percent']) ms= ms[ms["Percent"] > 0] f,ax =plt.subplots(figsize=(8,6)) plt.xticks(rotation='90') fig=sns.barplot(ms.index, ms["Percent"],color="green",alpha=0.8) plt.xlabel('Features', fontsize=15) plt.ylabel('Percent of missing values', fontsize=15) plt.title('Percent missing data by feature', fontsize=15) return ms
missingdata(train_df)

We will see a result like this:

The cabin, age, and embarked data has some missing values. And cabin information is largely missing. We need to do something about them. This is what we call Data Cleaning.

Data Cleaning

This is what we use 90% of the time. We will do Data Cleaning a lot for every single Machine Learning project. When the data is clean, we can easily jump ahead to the next step without worrying about anything.

The most common technique in Data Cleaning is filling missing data. You can fill the data missing with Mode, Mean, or Median. There is no absolute rule on these choices — you can try to choose one after another and see the performance. But, for a rule of thumb, you can only use mode for categorized data, and you can use median or mean for continuous data.

So let’s fill the embarkation data with Mode and the Age data with median.

train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace = True)train_df['Age'].fillna(train_df['Age'].median(), inplace = True)

The next important technique is to just remove the data, especially for largely missing data. Let’s do it for the cabin data.

drop_column = ['Cabin']train_df.drop(drop_column, axis=1, inplace = True)

Now we can check the data we have cleaned.

print('check the nan value in train data')print(train_df.isnull().sum())

Perfect! No missing data found. Means the data has been cleaned.

Feature Engineering

Now we have cleaned the data. The next thing we can do is Feature Engineering.

Feature Engineering is basically a technique for finding Feature or Data from the currently available data. There are several ways to do this technique. More often, it is about common sense.

Let’s take a look at the Embarked data: it is filled with Q, S, or C. The Python library will not be able to process this, since it is only able to process numbers. So you need to do something called One Hot Vectorization, changing the column into three columns. Let’s say Embarked_Q, Embarked_S, and Embarked_C which are filled with 0 or 1 whether the person embarked from that harbor or not.

The other example is SibSp and Parch. Maybe there is nothing interesting in both of those columns, but you might want to know how big the family was of the passenger who boarded in the ship. You might assume that if the family was bigger, then the chance of survival would increase, since they could help each other. On other hand, solo people would’ve had it hard.

So you want to create another column called family size, which consists of sibsp + parch + 1 (the passenger themself).

The last example is called bin columns. It is a technique which creates ranges of values to group several things together, since you assume it is hard to differentiate things with similar value. For example, Age. For a person aged 5 and 6, is there any significant difference? or for person aged 45 and 46, is there any big difference?

That’s why we create bin columns. Maybe for age, we will create 4 bins. Children (0–14 years), Teenager (14–20), Adult (20–40), and Elders (40+)

Let’s code them:

all_data = train_df
for dataset in all_data : dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
import re# Define function to extract titles from passenger namesdef get_title(name): title_search = re.search(' ([A-Za-z]+)\.', name) # If the title exists, extract and return it. if title_search: return title_search.group(1) return ""# Create a new feature Title, containing the titles of passenger namesfor dataset in all_data: dataset['Title'] = dataset['Name'].apply(get_title)# Group all non-common titles into one single grouping "Rare"for dataset in all_data: dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss') dataset['Title'] = dataset['Title'].replace('Ms', 'Miss') dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
for dataset in all_data: dataset['Age_bin'] = pd.cut(dataset['Age'], bins=[0,14,20,40,120], labels=['Children','Teenage','Adult','Elder'])
for dataset in all_data: dataset['Fare_bin'] = pd.cut(dataset['Fare'], bins=[0,7.91,14.45,31,120], labels ['Low_fare','median_fare', 'Average_fare','high_fare']) traindf=train_dffor dataset in traindf: drop_column = ['Age','Fare','Name','Ticket'] dataset.drop(drop_column, axis=1, inplace = True)
drop_column = ['PassengerId']traindf.drop(drop_column, axis=1, inplace = True)traindf = pd.get_dummies(traindf, columns = ["Sex","Title","Age_bin","Embarked","Fare_bin"], prefix=["Sex","Title","Age_type","Em_type","Fare_type"])

Now, you have finished all the features. Let’s take a look into the correlation for each feature:

sns.heatmap(traindf.corr(),annot=True,cmap='RdYlGn',linewidths=0.2) #data.corr()-->correlation matrixfig=plt.gcf()fig.set_size_inches(20,12)plt.show()

Korelacije s vrijednosti 1 znače visoko korelirane pozitivno, -1 znači visoko korelirane negativno . Primjerice, spolni mužjak i spolna žena korelirat će negativno, jer su se putnici morali identificirati kao jedan ili drugi spol. Osim toga, možete vidjeti da se ništa nije odnosilo ni na što posebno osim na ono stvoreno putem inženjerskog inženjeringa. To znači da smo spremni za polazak.

Što će se dogoditi ako nešto jako korelira s nečim drugim? Možemo eliminirati jednog od njih, jer dodavanje ni jedne informacije putem novog stupca neće dati sustavu nove informacije, jer su obje potpuno iste.

Strojno učenje s Pythonom

Sad smo stigli do vrha tutorijala: Modeliranje strojnog učenja.

from sklearn.model_selection import train_test_split #for split the datafrom sklearn.metrics import accuracy_score #for accuracy_scorefrom sklearn.model_selection import KFold #for K-fold cross validationfrom sklearn.model_selection import cross_val_score #score evaluationfrom sklearn.model_selection import cross_val_predict #predictionfrom sklearn.metrics import confusion_matrix #for confusion matrixall_features = traindf.drop("Survived",axis=1)Targeted_feature = traindf["Survived"]X_train,X_test,y_train,y_test = train_test_split(all_features,Targeted_feature,test_size=0.3,random_state=42)X_train.shape,X_test.shape,y_train.shape,y_test.shape

Možete odabrati mnogo algoritama koji su uključeni u scikit-learn knjižnicu.

  1. Logistička regresija
  2. Slučajna šuma
  3. SVM
  4. K Najbliži susjed
  5. Naive Bayes
  6. Decision Trees
  7. AdaBoost
  8. LDA
  9. Gradient Boosting

You might be overwhelmed trying to figure out what is what. Don’t worry, just treat is as a black box: choose one with the best performance. (I will create a whole article on these algorithms later.)

Let’s try it with my favorite one: the Random Forest Algorithm

from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier(criterion='gini', n_estimators=700, min_samples_split=10,min_samples_leaf=1, max_features="auto",oob_score=True, random_state=1,n_jobs=-1)model.fit(X_train,y_train)prediction_rm=model.predict(X_test)print('--------------The Accuracy of the model----------------------------')print('The accuracy of the Random Forest Classifier is', round(accuracy_score(prediction_rm,y_test)*100,2))kfold = KFold(n_splits=10, random_state=22) # k=10, split the data into 10 equal partsresult_rm=cross_val_score(model,all_features,Targeted_feature,cv=10,scoring='accuracy')print('The cross validated score for Random Forest Classifier is:',round(result_rm.mean()*100,2))y_pred = cross_val_predict(model,all_features,Targeted_feature,cv=10)sns.heatmap(confusion_matrix(Targeted_feature,y_pred),annot=True,fmt='3.0f',cmap="summer")plt.title('Confusion_matrix', y=1.05, size=15)

Wow! It gives us 83% accuracy. That’s good enough for our first time.

The cross validated score means a K Fold Validation method. If K = 10, it means you split the data in 10 variations and compute the mean of all scores as the final score.

Fine Tuning

Now you are done with the steps in Machine Learning with Python. But, there is one more step which can bring you better results: fine tuning. Fine tuning means finding the best parameter for Machine Learning Algorithms. If you see the code for random forest above:

model = RandomForestClassifier(criterion='gini', n_estimators=700, min_samples_split=10,min_samples_leaf=1, max_features="auto",oob_score=True, random_state=1,n_jobs=-1)

There are many parameters you need to set. These are the defaults, by the way. And you can change the parameters however you want. But of course, it will takes a lot of time.

Don’t worry — there is a tool called Grid Search, which finds the optimal parameters automatically. Sounds great, right?

# Random Forest Classifier Parameters tunning model = RandomForestClassifier()n_estim=range(100,1000,100)
## Search grid for optimal parametersparam_grid = {"n_estimators" :n_estim}
model_rf = GridSearchCV(model,param_grid = param_grid, cv=5, scoring="accuracy", n_jobs= 4, verbose = 1)
model_rf.fit(train_X,train_Y)
# Best scoreprint(model_rf.best_score_)
#best estimatormodel_rf.best_estimator_

Well, you can try it out for yourself. And have fun with Machine Learning.

Conclusion

How was it? It doesn’t seem very difficult does it? Machine Learning with Python is easy. Everything has been laid out for you. You can just do the magic. And bring happiness to people.

This piece was originally released on my blog at thedatamage.com