Pro Spark Streaming
本书介绍了如何使用 Spark Streaming 开发应用程序以及一些最佳实践。适合数据科学家、大数据专家、BI分析以及数据架构师阅读。
Cloudera-Hive
Hive data warehouse software enables reading, writing, and managing large datasets in distributed storage. Using the Hive query language (HiveQL), which is very similar to SQL, queries are converted into a series of jobs that execute on a Hadoop cluster through MapReduce or Apache Spark.
Cloudera-Spark
Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala and consists of Spark core and several related projects。
Advanced Analytics with Spark
Ever since we started the Spark project at Berkeley, I’ve been excited about not just
building fast parallel systems, but helping more and more people make use of largescale computing. This is why I’m very happy to see this book, written by four experts
in data science, on advanced analytics with Spark. Sandy, Uri, Sean, and Josh have
been working with Spark for a while, and have put together a great collection of con‐
tent with equal parts explanations and examples.
The thing I like most about this book is its focus on examples, which are all drawn
from real applications on real-world data sets. It’s hard to find one, let alone ten
examples that cover big data and that you can run on your laptop, but the authors
have managed to create such a collection and set everything up so you can run them
in Spark. Moreover, the authors cover not just the core algorithms, but the intricacies
of data preparation and model tuning that are needed to really get good results. You
should be able to take the concepts in these examples and directly apply them to your
own problems. Big data processing is undoubtedly one of the most exciting areas in computing today, and remains an area of fast evolution and introduction of new ideas. I hope that this book helps you get started in this exciting new field.
how to make mistakes in python
How to make mistakes in Python
Cloudera Impala
Cloudera Impala is an open source project that is opening up the
Apache Hadoop software stack to a wide audience of database analysts,
users, and developers. The Impala massively parallel processing
(MPP) engine makes SQL queries of Hadoop data simple enough to
be accessible to analysts familiar with SQL and to users of business
intelligence tools, and it’s fast enough to be used for interactive explo‐
ration and experimentation.
Spark2.0 For Beginners
Develop large-scale distributed data processing applications using Spark 2 in Scala and Python
Python高手之路
这不是一本常规意义上Python的入门书。这本书中既没有Python关键字和for循环的使用,也没有细致入微的标准库介绍,而是完全从实战的角度出发,对构建一个完整的Python应用所需掌握的知识进行了系统而完整的介绍。更为难得的是,本书的作者是开源项目OpenStack的PTL(项目技术负责人)之一,因此本书结合了Python在OpenStack中的应用进行讲解,非常具有实战指导意义。
Spark Cook book
Over 60 recipes on Spark, covering Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX libraries
Apress Text Analytics with Python
A Practical Real-World Approach to Gaining Actionable Insights from Your Data
Python Parallel Programming Cookbook
Master effcient parallel programming to build powerful applications using Python
Spark for Python Developers
Spark for Python Developers aims to combine the elegance and flexibility of Python
with the power and versatility of Apache Spark. Spark is written in Scala and runs
on the Java virtual machine. It is nevertheless polyglot and offers bindings and APIs
for Java, Scala, Python, and R. Python is a well-designed language with an extensive
set of specialized libraries. This book looks at PySpark within the PyData ecosystem.
Some of the prominent PyData libraries include Pandas, Blaze, Scikit-Learn,
Matplotlib, Seaborn, and Bokeh. These libraries are open source. They are developed,
used, and maintained by the data scientist and Python developers community.
PySpark integrates well with the PyData ecosystem, as endorsed by the Anaconda
Python distribution. The book puts forward a journey to build data-intensive apps
along with an architectural blueprint that covers the following steps: frst, set up the
base infrastructure with Spark. Second, acquire, collect, process, and store the data.
Third, gain insights from the collected data. Fourth, stream live data and process it in
real time. Finally, visualize the information.
Apache Kudu (incubating) User Guide
Apache Kudu (incubating) is a columnar storage manager developed for the Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation.
Impala Guide
Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS, HBase, or the Amazon Simple Storage Service (S3). In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Impala query UI in Hue) as Apache Hive. This provides a familiar and unified platform for real-time or batch-oriented queries.
SparkForDataScience
Spark For Data Science
Machine Learning in Python-Essential Techniques for Predictive Analysis
Machine Learning in Python-Essential Techniques for Predictive Analysis
Hadoop with Python
Hadoop is mostly written in Java, but that doesn't exclude the use of other programming languages with this distributed storage and processing framework, particularly Python. With this concise book, you'll learn how to use Python with the Hadoop Distributed File System (HDFS), MapReduce, the Apache Pig platform and Pig Latin script, and the Apache Spark cluster-computing framework.
Spark for Data Science
Analyze your data and delve deep into the world of machine learning with the latest Spark version, 2.0
The Hitchhiker's Guide to Python
The Hitchhiker's Guide to Python完整清晰版The Hitchhiker's Guide to Python完整清晰版
SQL Cookbook 中文清晰版带标签
许多人以一种马马虎虎的态度在使用SQL,根本没有意识到自己掌握着多么强大的武器。本书的目的是打开读者的视野,看看SQL究竟能干什么.
Python High Performance(第二版带目录)
Python High Performance(第二版带书签很清晰英文无水印)
Feature Engineering for Machine Learning
Feature Engineering for Machine Learning Principles and Techniques for Data Scientists
Machine Learning for the Web Explore the web and make smarter predictions
Data science and machine learning in particular are emerging as leading topics in
the tech commercial environment to evaluate the always increasing amount of
data generated by the users. This book will explain how to use Python to develop
a web commercial application using Django and how to employ some specific
libraries (sklearn, scipy, nltk, Django, and some others) to manipulate and
analyze (through machine learning techniques) data that is generated or used in
the application
Python Data Science Handbook
版权归作者所有,任何形式转载请联系作者。
作者:Tommy(来自豆瓣)
来源:https://book.douban.com/review/8367790/
本书内容对应的 Jupyter notebook 放在 GitHub 上。
https://github.com/jakevdp/PythonDataScienceHandbook
Python for Data Analysis 2nd
用python进行数据分析最新英文版。对数据分析入门很有帮助。
Deep Learning with Python
深度学习四大金刚之一,浅显易懂。值得深度学习爱好者入门。喜欢看英文原版的同学们可以看看。
Hands On Machine Learning with Scikit Learn and TensorFlow
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
RedHat6.5百度云下载链接
RedHat6.5百度云下载链接
RedHat9百度云下载链接
RedHat9百度云下载链接
RedHat7.3百度云下载链接
RedHat7.3百度云下载链接
RedHat7.0百度云下载链接
RedHat7.0百度云下载链接
Flask Framework Cookbook
Over 80 hands-on recipes to help you create small-to-large web applications using Flask
Web Development with Django Cookbook, 2nd Edition
Over 90 practical recipes to help you create scalable websites using the Django 1.8 framework.
Functional Python Programming
Create succinct and expressive implementations with functional programming in Python
Fast Data Processing with Spark 2 Third Edition
Learn how to use Spark to process big data at speed and scale for sharper analytics. Put the principles into practice for faster, slicker big data projects
Cloudera Data Management
This guide describes how to perform data management using Cloudera Navigator. Data management activities include auditing access to data residing in HDFS and Hive metastores, reviewing and updating metadata, and discovering the lineage of data objects。