
Na Internetu ima više informacija nego što ih bilo koji čovjek može upiti u svom životu. Ono što vam treba nije pristup tim informacijama, već prilagodljiv način prikupljanja, organiziranja i analize.
Treba vam struganje s weba.
Web struganje automatski izdvaja podatke i prikazuje ih u formatu koji lako možete razumjeti. U ovom ćemo se vodiču usredotočiti na njegove primjene na financijskom tržištu, ali struganje weba može se koristiti u najrazličitijim situacijama.
Ako ste zagriženi investitor, svakodnevno zatvaranje cijena može vam naštetiti, posebno kada se potrebne informacije pronađu na nekoliko web stranica. Olakšat ćemo ekstrakciju podataka izgradnjom web strugača za automatsko preuzimanje indeksa dionica s Interneta.

Početak rada
Koristit ćemo Python kao naš jezik za struganje, zajedno s jednostavnom i moćnom bibliotekom BeautifulSoup.
- Za korisnike Maca Python je unaprijed instaliran u OS X. Otvorite Terminal i upišite
python --version
. Trebali biste vidjeti da je vaša verzija pythona 2.7.x. - Za korisnike Windowsa instalirajte Python putem službenog web mjesta.
Dalje moramo nabaviti knjižnicu BeautifulSoup pomoću pip
alata za upravljanje paketima za Python.
U terminal unesite:
easy_install pip pip install BeautifulSoup4
Napomena : Ako ne uspijete izvršiti gornji redak za naredbe, pokušajte dodati sudo
ispred svakog retka.
Osnove
Prije nego što počnemo skakati u kod, shvatimo osnove HTML-a i neka pravila struganja.
HTML oznake
Ako već razumijete HTML oznake, slobodno preskočite ovaj dio.
First Scraping
Hello World
Ovo je osnovna sintaksa HTML web stranice. Svaki služi blok unutar web stranice:
1 .: HTML dokumenti moraju započeti deklaracijom tipa.
2. HTML dokument nalazi se između i
.
3. Meta i skripta deklaracije HTML dokumenta nalazi se između i
.
4. vidljivi dio HTML dokumenta je između te
oznake.
5. Naslovi naslova definirani su s

Original text
kroz
oznake.
oznake.6. Odlomci su definirani s
Other useful tags include
for hyperlinks,
for tables,
for table rows, and
za stupce tablice. Također, HTML oznake ponekad dolaze sa Za više informacija o HTML oznakama, id-u i klasi, pogledajte Vodiče za W3Schools. Pravila struganja
Pregled straniceUzmimo za primjer jednu stranicu s web stranice Bloomberg Quote. Kao netko tko prati burzu, željeli bismo s ove stranice dobiti naziv indeksa (S&P 500) i njegovu cijenu. Prvo kliknite desnu tipku miša i otvorite inspektor preglednika da biste pregledali web stranicu. ![]() Pokušajte zadržati pokazivač na cijeni i trebali biste vidjeti plavi okvir koji ga okružuje. Ako ga kliknete, povezani će se HTML odabrati na konzoli preglednika. ![]() From the result, we can see that the price is inside a few levels of HTML tags, which is Similarly, if you hover and click the name “S&P 500 Index”, it is inside .![]() Now we know the unique location of our data with the help of Jump into the CodeNow that we know where our data is, we can start coding our web scraper. Open your text editor now! First, we need to import all the libraries that we are going to use.
Next, declare a variable for the url of the page.
Then, make use of the Python urllib2 to get the HTML page of the url declared.
Finally, parse the page into BeautifulSoup format so we can use BeautifulSoup to work on it.
Now we have a variable, Remember the unique layers of our data? BeautifulSoup can help us get into these layers and extract the content with
After we have the tag, we can get the data by getting its
Similarly, we can get the price too.
When you run the program, you should be able to see that it prints out the current price of the S&P 500 Index. ![]() Export to Excel CSVNow that we have the data, it is time to save it. The Excel Comma Separated Format is a nice choice. It can be opened in Excel so you can see the data and process it easily. But first, we have to import the Python csv module and the datetime module to get the record date. Insert these lines to your code in the import section.
At the bottom of your code, add the code for writing data to a csv file.
Now if you run your program, you should able to export an ![]() So if you run this program everyday, you will be able to easily get the S&P 500 Index price without rummaging through the website! Going Further (Advanced uses)Multiple Indices So scraping one index is not enough for you, right? We can try to extract multiple indices at the same time. First, modify the
Then we change the data extraction code into a
Also, modify the saving section to save data row by row.
Rerun the program and you should be able to extract two indices at the same time! Advanced Scraping TechniquesBeautifulSoup is simple and great for small-scale web scraping. But if you are interested in scraping data at a larger scale, you should consider using these other alternatives:
Adopt the DRY Method![]() DRY stands for “Don’t Repeat Yourself”, try to automate your everyday tasks like this person. Some other fun projects to consider might be keeping track of your Facebook friends’ active time (with their consent of course), or grabbing a list of topics in a forum and trying out natural language processing (which is a hot topic for Artificial Intelligence right now)! If you have any questions, please feel free to leave a comment below. References //www.gregreda.com/2013/03/03/web-scraping-101-with-python/ //www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/ This article was originally published on Altitude Labs’ blog and was written by our software engineer, Leonard Mok. Altitude Labs is a software agency that specializes in personalized, mobile-first React apps. |