Pregel: A System for Large-Scale Graph Processing
Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs—in some cases billions of vertices, trillions of edges—poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex- centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thou- sands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution- related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.
OPTICS: Ordering Points To Identify the Clustering Structure
OPTICS: Ordering Points To Identify the Clustering Structure
data cube 2
Data Mining: Concepts and Techniques - Data Cube Technology
Data Cubes
Advanced Data Management - Data Cubes
Clustering Analysis
Data Mining: Concepts and Techniques, Clustering Analysis
Aurora: a new model and architecture for data stream management
This paper describes the basic processing model and architecture of Aurora, a new system to manage data streams for monitoring applications. Monitoring applications differ substantially from conventional business data processing.The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than from human operators requires one to rethink the fundamental architecture of a DBMS for this application area. In this paper, we present Aurora, a new DBMS currently under construction at Brandeis University, Brown University, and M.I.T. We first provide an overview of the basic Aurora model and architecture and then describe in detail a stream-oriented set of operators.
F1: A Distributed SQL Database That Scales
F1 is a distributed relational database system built at
Google to support the AdWords business. F1 is a hybrid
database that combines high availability, the scalability of
NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases. F1 is built on Spanner, which provides synchronous cross-datacenter replication and strong consistency. Synchronous replication implies higher commit latency, but we mitigate that latency
by using a hierarchical schema model with structured data
types and through smart application design. F1 also includes a fully functional distributed SQL query engine and
automatic change tracking and publishing.
Weka(http://blog.csdn.net/quantum_bit/article/details/44665555)
使用Weka语料库时需要用到的源文件,详情:http://blog.csdn.net/quantum_bit/article/details/44665555
原始数据(http://blog.csdn.net/quantum_bit/article/details/44665555)
Predict if the car purchased at the Auction is a good bad buy
The dependent variable IsBadBuy is binary C2
There are 32 Independent variables C3 C34
paper: Generalized Search Trees for Database Systems
This paper introduces the Generalized Search Tree (GiST), an index structure supporting an extensible set of queries and data types. The GiST allows new data types to be indexed in a manner supporting queries natural to the types; this is in contrast to previous work on tree extensibility which only supported the traditional set of equality and range predicates. In a single data structure, the GiST provides all the basic search tree logic required by a database system, thereby unifying disparate structures such as B+-trees and R-trees in a single piece of code, and opening the application of search trees to general extensibility.
To illustrate the flexibility of the GiST, we provide simple method implementations that allow it to behave like a B+-tree, an R-tree, and an RD-tree, a new index for data with set-valued attributes. We also present a preliminary performance analysis of RD-trees, which leads to discussion on the nature of tree indices and how they behave for various datasets.
Access Path Selection in a Relational Database Management System
In a high level query and data manipulation language such as SQL, requests are stated non-procedurally, without reference to access paths. This paper describes how System R chooses access paths for both simple (single relation) and complex queries (such as joins), given a user specification of desired data as a boolean expression of predicates. System R is an experimental database management system developed to carry out research on the relational model of data. System R was designed and built by members of the IBM San Jose Research Laboratory.
Algorithm: recursion
伊利诺伊大学厄本那香槟分校(University of Illinois at Urbana-Champaign)算法课(CS374)讲义,主要讲述递归(recursion),由Jeff Erickson撰写
Concurrency Control and Recovery
Introduction Many service-oriented businesses and organizations, such as banks, airlines, catalog retailers, hospitals, etc. have grown to depend on fast, reliable, and correct access to their "mission-critical" data on a constant basis. In many cases, particularly for global enterprises, 7x24 access is required; that is, the data must be available seven days a week, twenty-four hours a day. Data Base Management Systems (DBMS) are often employed to meet these stringent performance, availability, and reliability demands. As a result, two of the core functions of a DBMS are: 1) to protect the data stored in the database and 2) to provide correct and highly available access to that data in the presence of concurrent access by large and diverse user populations, and despite various software and hardware failures. The responsibility for these functions resides in the concurrency control and recovery components of the DBMS software.
A Comparison of Approaches to Large-Scale Data Analysis
There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system’s performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.
Data Cube: A Relational Aggregation Operator
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
Abstract: Data analysis applications typically aggregate
data across many dimensions looking for unusual patterns.
The SQL aggregate functions and the GROUP BY operator
produce zero-dimensional or one-dimensional answers.
Applications need the N-dimensional generalization of
these operators. This paper defines that operator, called
the data cube or simply cube. The cube operator generalizes the histogram, cross-tabulation, roll-up, drill-down,
and sub-total constructs found in most report writers. The
cube treats each of the N aggregation attributes as a dimension of N-space. The aggregate of a particular set of
attribute values is a point in this space. The set of points
forms an N-dimensional cube. Super-aggregates are computed by aggregating the N-cube to lower dimensional
spaces. Aggregation points are represented by an "infinite
value", ALL. For example, the point (ALL,ALL,ALL,...,ALL, sum(*)) would represent the global sum of all items. Each ALL value actually represents the set of values contributing to that aggregation.
Mining Heterogeneous Information Networks
Mining Heterogeneous Information Networks: Principles and Methodologies.
Synthesis Lectures on Data Mining and Knowledge Discovery
RankClus论文
信息网络聚类分析算法论文,作者是领域专家韩家炜。
SimRank论文
信息网络聚类分析算法论文。出自斯坦福大学。
Bigtable: A Distributed Storage System for Structured Data
Bigtable是一个存储结构化数据的分布式存储系统,容量可以扩展到PB。很多谷歌项目都是用Bigtable存储数据,比如网页检索,谷歌地球, 和谷歌金融。不同应用对Bigtable有不同的要求,例如数据量和延迟率。但不管是什么样的要求,Bigtable都很好地给谷歌产品提供了灵活和高性能能的解决方案。这篇论文描述了Bigtable里的数据模型,设计,已经实现。
R-trees: a dynamic index structure for spatial searching
一篇著名的数据库数据检索论文,由加州伯克利大学的Antonin Guttman撰写。
论文索引:
在计算机辅助设计跟空间数据应用中,为了能高效处理空间数据,数据库系统需要一个检索功能从而根据空间位置来迅速抓取数据。但是,传统的检索方法不适合用于多维空间中的有限大小的数据对象。在这篇论文中我们设计了一个叫做R-树的动态检索结构,这种设计满足了我们新的需求并且包含了搜索跟更新数据的算法。通过一系列的测试我们发现这种数据结构非常高效并认为这种结构适合于当前的空间数据应用数据库系统。
谷歌F1分布式数据库介绍
伊利诺伊大学厄本那香槟分校(University of Illinois at Urbana-Champaign)高等数据管理课讲义,本篇介绍谷歌的F1分布式数据库,相关论文:http://download.csdn.net/detail/quantum_bit/8556099
淘宝软件基础设施实践-拥抱开源
开源力量公开课第20期庆典 - "拥抱开源,企业IT自主之路" ,淘宝软件基础设施实践
短语挖掘
伊利诺伊大学厄本那香槟分校(University of Illinois at Urbana-Champaign),韩家炜数据挖掘课讲义,本篇主要讲述各种短语挖掘算法
数据挖掘:网络概念与网络建模
伊利诺伊大学厄本那香槟分校(University of Illinois at Urbana-Champaign),韩家炜在数据挖掘课上使用的讲义。主要介绍数据网络概念与网络模型
信息网络分析
伊利诺伊大学厄本那香槟分校(University of Illinois at Urbana-Champaign),韩家炜数据挖掘课讲义,本篇主要讲述各种网络分析算法
r-tree
伊利诺伊大学厄本那香槟分校(University of Illinois at Urbana-Champaign)高等数据管理课,R-树课堂幻灯片。上课内容基于R-树论文:http://download.csdn.net/detail/quantum_bit/8458419
Naiad: A Timely Dataflow System(presenation)
Naiad: A Timely Dataflow System(presenation)
Naiad: A Timely Dataflow System
Naiad: A Timely Dataflow System
Pig Latin: A Not-So-Foreign Language for Data Processing
Pig Latin: A Not-So-Foreign Language for Data Processing
Advanced Data Management: Aurora and Stream Processing
Advanced Data Management: Aurora and Stream Processing
Advanced Data Management: NewSQL systems and F1
Advanced Data Management: NewSQL systems and F1
Advanced Data Management: mapreduce
Advanced Data Management: mapreduce
Advanced Data Management: data cube
Advanced Data Management: data cube
Implementing data cube efficiently
Implementing data cube efficiently
Transactions: Concurrency Control and Recovery: Optimist concurrency control
Transactions: Concurrency Control and Recovery: Optimist concurrency control
On Optimistic Methods for Concurrency Control
On Optimistic Methods for Concurrency Control
Transactions: Concurrency Control and Recovery
Transactions: Concurrency Control and Recovery
Indexing: GiST
Advanced Data Management
Indexing: GiSTs
Indexing: R-Trees
CS511Advanced Data Management, Indexing: R-Trees