Quantum_bit-CSDN博客

Pregel: A System for Large-Scale Graph Processing

Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs—in some cases billions of vertices, trillions of edges—poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex- centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thou- sands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution- related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

2015-04-17

OPTICS: Ordering Points To Identify the Clustering Structure

2015-04-11

data cube 2

Data Mining: Concepts and Techniques - Data Cube Technology

2015-04-10

Data Cubes

Advanced Data Management - Data Cubes

2015-04-10

Clustering Analysis

Data Mining: Concepts and Techniques, Clustering Analysis

2015-04-08

Aurora: a new model and architecture for data stream management

This paper describes the basic processing model and architecture of Aurora, a new system to manage data streams for monitoring applications. Monitoring applications differ substantially from conventional business data processing.The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than from human operators requires one to rethink the fundamental architecture of a DBMS for this application area. In this paper, we present Aurora, a new DBMS currently under construction at Brandeis University, Brown University, and M.I.T. We first provide an overview of the basic Aurora model and architecture and then describe in detail a stream-oriented set of operators.

2015-04-08

F1: A Distributed SQL Database That Scales

F1 is a distributed relational database system built at Google to support the AdWords business. F1 is a hybrid database that combines high availability, the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases. F1 is built on Spanner, which provides synchronous cross-datacenter replication and strong consistency. Synchronous replication implies higher commit latency, but we mitigate that latency by using a hierarchical schema model with structured data types and through smart application design. F1 also includes a fully functional distributed SQL query engine and automatic change tracking and publishing.

2015-04-02

Weka（http://blog.csdn.net/quantum_bit/article/details/44665555）

使用Weka语料库时需要用到的源文件，详情：http://blog.csdn.net/quantum_bit/article/details/44665555

2015-03-28

原始数据（http://blog.csdn.net/quantum_bit/article/details/44665555）

Predict if the car purchased at the Auction is a good bad buy The dependent variable IsBadBuy is binary C2 There are 32 Independent variables C3 C34

2015-03-27

paper: Generalized Search Trees for Database Systems

This paper introduces the Generalized Search Tree (GiST), an index structure supporting an extensible set of queries and data types. The GiST allows new data types to be indexed in a manner supporting queries natural to the types; this is in contrast to previous work on tree extensibility which only supported the traditional set of equality and range predicates. In a single data structure, the GiST provides all the basic search tree logic required by a database system, thereby unifying disparate structures such as B+-trees and R-trees in a single piece of code, and opening the application of search trees to general extensibility. To illustrate the flexibility of the GiST, we provide simple method implementations that allow it to behave like a B+-tree, an R-tree, and an RD-tree, a new index for data with set-valued attributes. We also present a preliminary performance analysis of RD-trees, which leads to discussion on the nature of tree indices and how they behave for various datasets.

2015-03-26

Access Path Selection in a Relational Database Management System

In a high level query and data manipulation language such as SQL, requests are stated non-procedurally, without reference to access paths. This paper describes how System R chooses access paths for both simple (single relation) and complex queries (such as joins), given a user specification of desired data as a boolean expression of predicates. System R is an experimental database management system developed to carry out research on the relational model of data. System R was designed and built by members of the IBM San Jose Research Laboratory.

2015-03-16

Algorithm: recursion

伊利诺伊大学厄本那香槟分校（University of Illinois at Urbana-Champaign）算法课（CS374）讲义，主要讲述递归（recursion），由Jeff Erickson撰写

2015-03-10

Concurrency Control and Recovery

Introduction Many service-oriented businesses and organizations, such as banks, airlines, catalog retailers, hospitals, etc. have grown to depend on fast, reliable, and correct access to their "mission-critical" data on a constant basis. In many cases, particularly for global enterprises, 7x24 access is required; that is, the data must be available seven days a week, twenty-four hours a day. Data Base Management Systems (DBMS) are often employed to meet these stringent performance, availability, and reliability demands. As a result, two of the core functions of a DBMS are: 1) to protect the data stored in the database and 2) to provide correct and highly available access to that data in the presence of concurrent access by large and diverse user populations, and despite various software and hardware failures. The responsibility for these functions resides in the concurrency control and recovery components of the DBMS software.

2015-03-09

A Comparison of Approaches to Large-Scale Data Analysis

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system’s performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

2015-03-08

Data Cube: A Relational Aggregation Operator

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals Abstract: Data analysis applications typically aggregate data across many dimensions looking for unusual patterns. The SQL aggregate functions and the GROUP BY operator produce zero-dimensional or one-dimensional answers. Applications need the N-dimensional generalization of these operators. This paper defines that operator, called the data cube or simply cube. The cube operator generalizes the histogram, cross-tabulation, roll-up, drill-down, and sub-total constructs found in most report writers. The cube treats each of the N aggregation attributes as a dimension of N-space. The aggregate of a particular set of attribute values is a point in this space. The set of points forms an N-dimensional cube. Super-aggregates are computed by aggregating the N-cube to lower dimensional spaces. Aggregation points are represented by an "infinite value", ALL. For example, the point (ALL,ALL,ALL,...,ALL, sum(*)) would represent the global sum of all items. Each ALL value actually represents the set of values contributing to that aggregation.

2015-03-05

Mining Heterogeneous Information Networks

Mining Heterogeneous Information Networks: Principles and Methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery

2015-03-03

RankClus论文

信息网络聚类分析算法论文，作者是领域专家韩家炜。

2015-03-02

SimRank论文

信息网络聚类分析算法论文。出自斯坦福大学。

2015-03-02

Bigtable: A Distributed Storage System for Structured Data

Bigtable是一个存储结构化数据的分布式存储系统，容量可以扩展到PB。很多谷歌项目都是用Bigtable存储数据，比如网页检索，谷歌地球，和谷歌金融。不同应用对Bigtable有不同的要求，例如数据量和延迟率。但不管是什么样的要求，Bigtable都很好地给谷歌产品提供了灵活和高性能能的解决方案。这篇论文描述了Bigtable里的数据模型，设计，已经实现。

2015-03-01

R-trees: a dynamic index structure for spatial searching

一篇著名的数据库数据检索论文，由加州伯克利大学的Antonin Guttman撰写。论文索引：在计算机辅助设计跟空间数据应用中，为了能高效处理空间数据，数据库系统需要一个检索功能从而根据空间位置来迅速抓取数据。但是，传统的检索方法不适合用于多维空间中的有限大小的数据对象。在这篇论文中我们设计了一个叫做R-树的动态检索结构，这种设计满足了我们新的需求并且包含了搜索跟更新数据的算法。通过一系列的测试我们发现这种数据结构非常高效并认为这种结构适合于当前的空间数据应用数据库系统。

2015-02-26