Good day friends.
Solved the task of compiling a Habrahabr dictionary for the purpose of tracking the emergence of new languages, frameworks, management practices, etc. In short, new words.
The result was a list of English words "in the nominative and singular."
Made in the environment of Windows 10 x64, used the Python 3 language in the Spyder editor in Anaconda 5.1.0, used a wired connection to the network.
In this article, I get a dictionary of English words on a limited sample. If the topic turns out to be interesting, then in the future I plan to get a dictionary of both English and Russian words on a full sample of Habr's articles. With the Russian language is more complicated.
Parsing process
A pig took from here . Just below is the code for my version of the parser.
To collect the Habr dictionary, you need to bypass its articles and select the text of articles from them. I did not process the meta information of the articles. Articles on Habré have my own “number”, such as https://habr.com/post/346198/ . Searching for articles can be done from 0 to 354366, this was the last article at the time of the project.
For each number we are trying to get an html page and, when it succeeds, then we take out the title and text of the article from the html structure. The bypass code is:
import pandas as pd import requests from bs4 import BeautifulSoup dataset = pd.DataFrame() for pid in range(350000,354366): r = requests.get('https://habrahabr.ru/post/' +str(pid) + '/') soup = BeautifulSoup(r.text, 'html5lib') if soup.find("span", {"class": "post__title-text"}): title = soup.find("span", {"class": "post__title-text"}).text text = soup.find("div", {"class": "post__text"}).text my_series = pd.Series([pid, title, text], index=['id', 'title', 'text']) df_new = pd.DataFrame(my_series).transpose() dataset = dataset.append(df_new, ignore_index = True)
Experienced to establish that the articles themselves are less than the numbers of three times. I trained on 4366 numbers - my system loads this amount in half an hour.
I did not work on speed optimization, although they say that if you start processing in 100 threads, it will be much faster.
I saved the result to disk
dataset.to_excel(directory+'dataset.xlsx', sheet_name='sheet1', index=False)
- so as not to repeat the slow download from the Internet. The file turned out the size of 10 megabytes.
I was interested in the English names of the instruments. I did not need the terms in different forms, I wanted to immediately get the normal forms of words. It is clear that the words "in", "on" and "on" are most often met, we remove them. To normalize the dictionary, I used English Porter Stemmer from the ntlk library.
Directly to create a list of vocabulary words, I used a slightly indirect method, see the code starting from "from sklearn.feature_extraction.text import CountVectorizer". I'll need this later.
import re import nltk nltk.download('stopwords') from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer corpus = [] for i in range(0, len(dataset.index)): review = re.sub('[^a-zA-Z]', ' ', dataset['text'][i]) review = review.lower() review = review.split() ps = PorterStemmer() review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] review = ' '.join(review) corpus.append(review) from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer() X = cv.fit_transform(corpus).toarray() names = cv.get_feature_names() dfnames = pd.DataFrame(names).transpose() dfnames.to_excel(directory+'names.xlsx', sheet_name='sheet1', index=False)
The names object is the desired dictionary. We saved it to disk.
Results Review
It turned out more than 30 thousand pieces of normalized words. And these are only 4,366 article numbers and words in English only.
From the interesting:
The authors of the articles use many strange "words", for example: aaaaaaaaaaa, aaaabbbbccccdddd or zzzhoditqxfpqbcwr
- From object X we get the Top 10 most popular English words in our sample:
Word pc
iter 4133
op 4030
return 2866
ns 2834
id 2740
name 2556
new 2410
data 2381
string 2358
http 2304