xingzhixi-CSDN博客

原创 chukwa学习3——Log4J

简介：在应用程序中添加日志记录总的来说基于三个目的：监视代码中变量的变化情况，周期性的记录到文件中供其他应用进行统计分析工作；跟踪代码运行时轨迹，作为日后审计的依据；担当集成开发环境中的调试器的作用，向文件或控制台打印代码的调试信息。一 . 在强调可重用组件开发的今天，除了自己从头到尾开发一个可重用的日志操作类外，Apache为我们提供了一个强有力的日志操作包-Log4j

2012-06-22 10:53:01 686

原创 chukwa学习2——Jetty

Jetty 简介： Jetty 是一个开源的servlet容器，它为基于Java的web内容，例如JSP和servlet提供运行环境。Jetty是使用Java语言编写的，它的API以一组JAR包的形式发布。开发人员可以将Jetty容器实例化成一个对象，可以迅速为一些独立运行（stand-alone）的Java应用提供网络和web连接。一 . 特征简介

2012-06-21 20:03:53 394

原创 chukwa 学习———— JAX-RS

JAX-RS简介： JAX-RS (JSR-311) 是一种 Java™ API，可使 Java Restful 服务的开发变得迅速而轻松。这个 API 提供了一种基于注释的模型来描述分布式资源。注释被用来提供资源的位置、资源的表示和可移植的（pluggable）数据绑定架构。基于Hadoop的日志收集系统哦chukwa在里面很好的应用了该服务。一

2012-06-19 19:45:02 660

顾名思义LinkedHashMap是比HashMap多了一个链表的结构。与HashMap相比LinkedHashMap维护的是一个具有双重链表的HashMap，LinkedHashMap支持2中排序一种是插入排序，一种是使用排序，最近使用的会移至尾部例如 M1 M2 M3 M4，使用M3后为 M1 M2 M4 M3了，LinkedHashMap输出时其元素是有顺序的，而HashMap输出时是随机的

2012-03-06 09:40:44 312

转载 100 Essential Web Development Tools

Web 技术突飞猛进，Web 设计与开发者们可以选择的工具越来越多，Web 开发者的技巧不再只限于 HTML 和服务器端编程，还需要精通各种第三方资源，这些第三方资源有时候比你的项目更复杂，更专业，你无法自己实现一切，借助一些 Web API，你可以很方便地将大量优秀的第三方资源集成到自己的站点。本文全面搜集 Web 开发中可能用到的各种第三方资源。1. 函数与类库A. CAPT

2012-01-27 15:02:53 615

转载 Mahout算法集

转载▼标签：杂谈分类： mahout Apache Mahout 是 ApacheSoftware Foundation (ASF) 旗下的一个开源项目，提供一些可扩展的机器学习领域经典算法的实现，旨在帮助开发人员更加方便快捷地创建智能应用程序，并且，在 Mahout 的最近版本中还加入了对Apache Hadoop 的支持

2012-01-27 14:29:34 852

转载 new data sets

1、气候监测数据集 http://cdiac.ornl.gov/ftp/ndp026b2、几个实用的测试数据集下载的网站http://www.cs.toronto.edu/~roweis/data.htmlhttp://www.cs.toronto.edu/~roweis/data.htmlhttp://kdd.ics.uci.edu/summary.task.type.html

2012-01-27 14:28:58 406

转载 Datasets for Data Mining

Data Visualization and Exploration SitesGoogle Public Data, with dynamic visualization and exploration tools. Tableau Public, free software for visualizing and sharing dataSwivel PublicD

2012-01-27 11:58:14 734

原创 SLEPc

网址：http://www.grycap.upv.es/slepc/SLEPc is a software library for the solution of large scale sparse eigenvalue problems on parallel computers. It is an extension of PETSc and can be used for ei

2011-11-18 10:54:56 813

原创 Mapreduce bibliography

[1]Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI 2004, pages 137-150, 2004. [ bib | .html ][2]Jeffrey Dean and Sanjay Ghemawat. Mapreduce

2011-11-16 15:52:25 956

原创 Statistics about Hadoop and Mapreduce Algorithm Papers

Underneath are statistics about which 20 papers (of about 80 papers) were most read in our 3 previous postings about mapreduce and hadoop algorithms (the postings have been read approximately 5000

2011-11-16 15:51:07 537

原创 Mapreduce & Hadoop Algorithms in Academic Papers (3rd update)

Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and research. Contact us if you need help with algorithms for mapr

2011-11-16 15:49:46 680

原创 Janrain 使用文档

DocumentationAdditional Documentation:Engage for Android - Library for Android app supportEngage for iOS - Library for native iOS app supportProvider Guide - Features supported by

2011-10-30 13:57:25 1313

原创 OAuth 学习笔记

OAuth基本流程简介（以新浪微薄为例）OAuth请求循环可以分为如下四步：OAuth提供两种认证方式：query-string和http headers。我们推荐使用http header进行认证。请求签名所有的OAuth请求使用同样的算法来生成(signature base string)签名字符基串和签名。base string是把http方法名,

2011-10-29 21:52:31 1071

原创 Data Mining Winter 2010 Resources (from last year's course website):

TheFind Shopping Search Engine Dataset Craigslist Data (data will be uploaded soon!) All Tweets and some associated metadata from June 2009Memetracker Dataset (More than 1 million ne

2011-10-28 10:50:00 482

原创 Advanced Topics in Data Mining Spring 2011

Books (PDFs):Mining Massive Datasets by A. Rajaraman, J. Ullman.Networks, Crowds, and Markets: Reasoning About a Highly Connected World by D. Easley, J. Kleinberg.Data-Intensive Te

2011-10-28 10:46:16 457

原创 Proceedings of the Tenth SIAM International Conference on Data Mining

Sessions: S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 S21Session S1: Text Mining1 Text Categorization Using Word Similarities Based on Higher Order Co-occurre

2011-10-28 10:44:23 2844

原创 GraphLab collaborative filtering library: efficient probabilistic matrix/tensor factorization on mul

Note: http://graphlab.org/pmf.htmlThis webpage explains how to use GraphLab collaborative filtering library. In this library, multiple matrix decomposition algorithms are implemented. See desc

2011-10-28 10:42:48 1305

原创 Mapreduce and Mobile Algorithms.

MapReduce Algorithms:Introductory slides:http://code.google.com/edu/submissions/mapreduce-minilecture/lec2-mapred.pptTalk videos:http://code.google.com/edu/submissions/mapreduce-minilect

2011-10-28 10:41:55 1350

原创 Using your laptop to compute PageRank for millions of webpages

The PageRank algorithm is a great way of using collective intelligence to determine the importance of a webpage. There’s a big problem, though, which is that PageRank is difficult to apply to the web

2011-10-28 10:40:49 569

原创 What are some good class projects for machine learning using MapReduce?

What are some good class projects for machine learning using MapReduce?We are looking for a (not necessarily academic) class project for a class where we are learning to implement various

2011-10-28 10:38:35 553

原创 Nutch 学习比较 3-----Fetcher

1. Fetcher功能介绍Fetcher这个模块在Nutch中有单独一个包在实现，在org.apache.nutch.fetcher，其中有Fetcher.java, FetcherOutput 和FetcherOutputFormat来组成，看上去很简单，但其中使用到了多线程，多线程的生产者与消费者模型，MapReduce的多路径输出等方法。下面我们来看一下Fetcher的注释

2011-10-24 10:45:42 411

原创通过JAVA—API访问HDFS 上的文件

1. 通过对core-site.xml配置文件进行配置。配置项：hadoop.tmp.dir表示命名节点上存放元数据的目录位置，对于数据节点则为该节点上存放文件数据的目录。配置项：fs.default.name表示命名的IP地址和端口号,缺省值是file:///，对于JavaAPI来讲，连接HDFS必须使用这里的配置的URL地址，对于数据节点来讲，数据节点通过该UR

2011-10-23 20:13:20 959

原创 Nutch 学习比较2 ---------Generate过程

1. Generate的作业在inject 之后就是Generate,这个方法主要是从CrawlDb中产生一个Fetch可以抓取的url集合(fetchlist). 这Nutch 1.3 版本中，支持在一次Generate为多个segment产生相应的fetchlists，而IP地址的解析只针对那些准备被抓取的url，在一个segment中，所有url都以IP,domain或

2011-10-23 15:57:47 516

原创 Nutch学习笔记1 ---------Inject

1. Inject 功能介绍在Nutch中Inject是用来把文本格式的url列表注入到抓取数据库中，一般是用来引导系统的初始化。其中文本格式的URL每一列包含一个url。同时inject里面保留了两个元数据。 nutch.score : 允许设置特定url的分数 nutch.fetchInterval : 表示特定url的抓取间隔，单位为毫秒。 e.g.

2011-10-23 11:34:03 722

xingzhixi的专栏