PiterPy 2019 / Andrey Gavrilov: Move from Pandas to Spark. Adaptation of machine learning models to work in a di... / Saint Petersburg, Russia / 1 November 2019

Andrey Gavrilov: Move from Pandas to Spark. Adaptation of machine learning models to work in a distributed environment

Description

Move from Pandas to Spark. Adaptation of machine learning models to work in a distributed environment.

Nowadays, Data Science and Big Data are two the most frequent buzzwords in the scope of working with data. Data Science is mainly devoted to the information analysis and making a data-driven decision, while, Big Data is about processing large volumes of data and integrating it into a manageable form. The adjacent area of solutions makes it natural to increase the volume of tasks related with the intersection of these two approaches. In other words, engineers are increasingly faced with the challenge of operationalizing ML models. The objective often consists in adapting ML models to work in a distributed environment.
Approaches to solving the problem of replacing implementations of machine learning algorithms with distributed analogues are presented in the report. In particular, group of related models that are used to produce word embeddings Word2vec (Gensim) are compared with an analogue from the distributed machine learning library MLlib (PySpark). A comparative analysis of the results of the singular decomposition procedure for implementations from PySpark MLlib and Scikit-learn (TruncatedSVD) is carried out. The issues of distributed (in HDInsight cluster) training of neural networks implemented using the Keras library (TensorFlow) are considered.

Key Words: Data Science, Big Data, Python, Spark, PySpark, MLlib, Word2vec, Scikit-learn, Keras, TensorFlow, Neural networks, SVD

 

Andrey Gavrilov
St. Petersburg, Russia
Big Data Software Engineer
EPAM

Work with Big Data and Data Science in EPAM. I studied Data Science in Peter the Great St. Petersburg Polytechnic University in department of Applied Math. I am interested in Python game-dev as well as information security.

Share
Add to calendar
Organizer
Coorganizer
Wargaming - silver sponsor
DELL TECHNOLOGIES - silver sponsor
CINDICATOR - silver sponsor
Selectel - silver sponsor
EPAM - silver sponsor
Partners
TRAVEL SPONSORS
Hashtag
#PiterPy
Event in socials
Contacts
Irina Saribekova
+7 (921) 903-45-17
irina@it-events.com

Congratulations!

You've successfully subscribed for news.