分布式机器学习 (4)-Implement Your MapReduce
提到 MapReduce,很自然想到的是 Hadoop MapReduce ,但是 MapReduce 只是一个编程范式,而 Hadoop MapReduce 则是这个编程范式的一个比较出名的实现。实际上,可以通过多种方式实现 MapReduce,本文要介绍的就是如何在 Linux 的 bash 下实现一个 MapReduce 程序,并且分别实现了单机版本和多机器版本。原视频见这里,需要自备梯子。
提到 MapReduce,很自然想到的是 Hadoop MapReduce ,但是 MapReduce 只是一个编程范式,而 Hadoop MapReduce 则是这个编程范式的一个比较出名的实现。实际上,可以通过多种方式实现 MapReduce,本文要介绍的就是如何在 Linux 的 bash 下实现一个 MapReduce 程序,并且分别实现了单机版本和多机器版本。原视频见这里,需要自备梯子。
This article mainly introduces a common technology behind several important Internet businesses (online advertising, recommendation systems, search engines): semantic understanding, and various methods to implement it, including matrix factorization, topic models, etc. The original video is here (requires VPN).
本文主要介绍了互联网几项重要业务(在线广告,推荐系统,搜索引擎)背后所需的一项共同技术:语义理解 (semantic understanding),同时介绍了实现语义理解的若干种方法:包括矩阵分解,主题模型 (Topic Models) 等。原视频见这里,需要自备梯子。
This lecture mainly introduces the classic method for mining frequent itemsets: FP-growth, and how to implement this algorithm through MapReduce. The MapReduce-implemented FP-growth is also called PFP, which can mine not only frequent itemsets but also infrequent itemsets. The original video is here (requires VPN).
这一讲主要介绍了挖掘频繁项集中的经典方法 FP-growth,以及如何通过 MapReduce 实现这个算法,通过 MapReduce 实现的 FP-growth 也称为 PFP,这个方法不仅能够挖掘频繁项集,还能够挖掘非频繁项集。原视频在这里,需要自备梯子。
This distributed machine learning series was shared by Wang Yi, covering distributed machine learning. As the author mentioned in the sharing, distributed machine learning differs significantly from the machine learning we commonly hear about today, so many views in the sharing run counter to what we learned from textbooks. The author has rich experience in this area—although it’s a three-year-old sharing, some technologies may have changed, but some views still have reference value.
I have doubts about some views in the sharing. Here I record them according to the author’s expression—perhaps only after I start working will I have the opportunity to verify their correctness.
This article mainly introduces some important concepts in distributed machine learning: real Internet data follows a long-tail distribution, “big is more important than fast,” and not blindly applying a framework. The corresponding video is here (requires VPN).
这个分布式机器学习系列是由王益分享的,讲的是分布式机器学习。正如作者在分享中所说,分布式机器学习与我们今天常听到的机器学习存在比较大的差异,因此分享中的很多观点跟我们从教课书上学到的机器学习是背道而驰的。作者在这方面具有丰富的经验,虽然是三年前的分享,或许分享中提到的部分技术改变了,但是其中的一些观点还是具有一定参考价值的。
笔者对于分享中的一些观点也是存在疑惑的,这里还是按照分享中作者表达的意思记录下来, 也许等到笔者工作后,才有机会去验证这些观点的正误。
本文主要介绍了分布式机器学习中的一些重要概念,如互联网的真实数据是长尾分布的、大比快要重要、不能盲目套用一个框架等,本文对应的视频在这里,需要自备梯子。
本文主要介绍机器学习中的一种集成学习的方法 stacking,本文首先介绍 stacking 这种方法的思想,然后提供一种实现 stacking 的思路,能够简单地拓展 stacking 中的基本模型。
This article introduces stacking, an ensemble learning method in machine learning. First, I explain the idea behind stacking, then provide an implementation approach that can easily extend the base models in stacking.
This article describes how to model action sequences using word2vec and CNN/RNN. I verified this approach in a recent competition—it indeed showed some effectiveness, achieving 0.87 accuracy in binary classification. This article mainly introduces the specific steps of this method, illustrated with the competition and code.