众_奴-CSDN博客

There may be many reasons that brought you here, it could be because you heard all about Hadoop and what it can do to crunch petabytes of data in a reasonable amount of time. While reading into Hadoop you found that for random access to the accumulated data there is something call HBase. Or it was the hype that is prevalent these days addressing a new kind of data storage architecture. It strives to solve large scale data problems where traditional solutions may either be too involved or cost prohibitive. A common term used in this area is NoSQL. No matter how you have arrived here, I presume you want to know and learn - like me not too long ago - how you can use HBase in your company or organization to store a virtually endless amount of data. You may have a background in relational databases theory or you want to start fresh and this "column oriented thing" is something that seems to fit your bill. You also heard that HBase can scale without much effort and that alone is reason enough to look at it since you are building the next web-scale system. I was at that point in late 2007 facing the task of storing millions of documents in a system that needed to be fault tolerant and scalable while still being maintainable by just me. I have decent skills in managing a MySQL database system and was using it to store data that would ultimately be served to our website users. This database was running on a single server, with another as a backup. The issue was that it would not be able to hold the amount of data I needed to store for this new project. I either invest into serious RDBMS scalability skills, or find something else instead. Obviously I went the latter route and since my mantra always was (and still is) "How does someone like Google do it?", I came across Hadoop. After a few attempts of using Hadoop directly I was faced with implementing a random access layer on top of it - but that problem had been solved already: in 2006 Google had published a paper called BigTable [1] and the Hadoop developers had an open-source implementation of it called HBase (the Hadoop Database). That was the answer to all my problems. Or so it seemed... What follows is a blur to me. Looking back I realize that I would have wished for this customer project to start today. HBase is now mature, nearing a 1.0 release and is used by many high profile companies, such as Facebook, Adobe, Twitter, and StumbleUpon. Mine was one of the very first clusters in production (and is still in use today!) and my use-case triggered a few very interesting issues (let me refrain from saying more). But that was to be expected betting on a 0.1x version of a community project. And I had the opportunity over the years to contribute back and stay close to the development team so that eventually I was humbled by being asked to become a full-time committer as well. I learned a lot over the last few years from my fellow HBase developers and am still learning more every day. My belief is that we are by far not at the peak of this technology and it will evolve further over the years to come. Let me pay my respect to the entire HBase community with this book which strives to cover not just the internal workings of HBase or how to get it going but more specifically how to apply it to your use-case. In fact, I strongly assume that this is why you are here right now. You want to learn how HBase can solve your problem. Let me help you trying to figure this out.

2012-09-05

Hadoop权威指南-中文版(前三章)

目录 I 初识Hadoop 1 1.1 数据！数据 1 1.2 数据的存储和分析 3 1.3 相较于其他系统 4 1.4 Hadoop发展简史 9 1.5 Apache Hadoop项目 12 MapReduce简介 15 2.1 一个气象数据集 15 2.2 使用Unix Tools来分析数据 17 2.3 使用Hadoop进行数据分析 19 2.4 分布化 30 2.5 Hadoop流 35 2.6 Hadoop管道 40 Hadoop分布式文件系统 44 3.1 HDFS的设计 44 3.2 HDFS的概念 45 3.3 命令行接口 48 3.4 Hadoop文件系统 50 3.5 Java接口 54 3.6 数据流 68 3.7 通过distcp进行并行复制 75 3.8 Hadoop归档文件 77 Hadoop的I/O 80 4.1 数据完整性 80 4.2 压缩 83 4.3 序列化 92 4.4 基于文件的数据结构 111 MapReduce应用开发 125 5.1 API的配置 126 5.2 配置开发环境 128 5.3 编写单元测试 134 5.4 本地运行测试数据 138 5.5 在集群上运行 144 5.6 作业调优 159 5.7 MapReduce的工作流 162 MapReduce的工作原理 166 6.1 运行MapReduce作业 166 6.2 失败 172 6.3 作业的调度 174 6.4 shuffle和排序 175 6.6 任务的执行 181 MapReduce的类型与格式 188 7.1 MapReduce类型 188 7.3 输出格式 217 MapReduce 特性 227 8.1 计数器 227 8.2 排序 235 8.3 联接 252 8.4 次要数据的分布 258 8.5 MapReduce的类库 263 Hadoop集群的安装 264 9.1 集群说明 264 9.2 集群的建立和安装 268 9.3 SSH配置 270 9.4 Hadoop配置 271 9.5 安装之后 286 9.6 Hadoop集群基准测试 286 9.7 云计算中的Hadoop 290 Hadoop的管理 293 10.1 HDFS 293 10.2 监控 306 10.3 维护 313 Pig简介 321 11.1 安装和运行Pig 322 11.2 实例 325 11.3 与数据库比较 329 11.4 Pig Latin 330 11.5 用户定义函数 343 11.6 数据处理操作符 353 11.7 Pig实践提示与技巧 363 Hbase简介 366 12.1 HBase基础 366 12.2 概念 367 12.3 安装 371 12.4 客户端 374 12.5 示例 377 12.6 HBase与RDBMS的比较 385 12.7 实践 390 ZooKeeper简介 394 13.1 ZooKeeper的安装和运行 395 13.2 范例 396 13.3 ZooKeeper服务 405 13.4 使用ZooKeeper建立应用程序 417 13.5 工业界中的ZooKeeper 428 案例研究 431 14.1 Hadoop在Last.fm的应用 431 14.2 Hadoop和Hive在Facebook的应用 441 14.3 Hadoop在Nutch搜索引擎 451 14.4 Hadoop用于Rackspace的日志处理 466 14.5 Cascading项目 474 14.6 Apache Hadoop的1 TB排序 488 Apache Hadoop的安装 491 Cloudera的Hadoop分发包 497 预备NCDC气象资料 502

2012-09-05

lzh8189146的专栏

空空如也

postman chrome 插件

Scala programing

Scala编程中文版

搜索引擎 solr 环境配置分词索引操作

solr 分布式部署

Lucene_in_Action（中文版）

JS正则表达式大全【6】

JS正则表达式大全【5】

JS正则表达式大全【4】

JS正则表达式大全【3】

JS正则表达式大全【2】

JS正则表达式大全【1】

HBase_官方文档-中文翻译

HBase：权威指南（英文版）

Hadoop权威指南-中文版(前三章)

空空如也

空空如也

postman chrome 插件

Scala programing

Scala编程中文版

搜索引擎 solr 环境配置 分词 索引 操作

solr 分布式部署

Lucene_in_Action（中文版）

JS正则表达式大全【6】

JS正则表达式大全【5】

JS正则表达式大全【4】

JS正则表达式大全【3】

JS正则表达式大全【2】

JS正则表达式大全【1】

HBase_官方文档-中文翻译

HBase：权威指南（英文版）

Hadoop权威指南-中文版(前三章)

空空如也

搜索引擎 solr 环境配置分词索引操作