Introduction to Parallel Computing: mudanças entre as edições

De Pontão Nós Digitais
Ir para navegaçãoIr para pesquisar
Sem resumo de edição
(c language review linx)
 
(125 revisões intermediárias por 12 usuários não estão sendo mostradas)
Linha 1: Linha 1:
This is the main page of a graduate-level course in parallel computing being taught in 2012/2 at the Polytechnic Institute [http://pt.wikipedia.org/wiki/IPRJ IPRJ]/UERJ. It is generally useful for programmers at the advanced level in the fields of scientific and multimedia programming.
This is the main page of a graduate and senior undergraduate level course in parallel computing being taught in 2017 (semester 2016/2 due to a series of work stoppages) at the Polytechnic Institute [http://pt.wikipedia.org/wiki/IPRJ IPRJ]/UERJ. It is generally useful for programmers at the advanced level in the fields of scientific and multimedia programming.
[[Imagem:Matrix6.png|right|550px]]
 
* Links to the course pages for previous years: '''[[PP2015|2015]]''' '''[[PP2012|2012]]'''


== General Info ==
== General Info ==
* Meeting times: Tues 12:30pm-2pm, Thursdays 2:20pm - 4pm
* Instructor: prof. [http://rfabbri.github.io Ricardo Fabbri], Ph.D. Brown University
* Evaluation criteria: 1 quizz at the end of the term (60%), plus practical projects (40%).
* Meeting times: Tuesdays 2:20pm-4pm Thursdays 8:50am - 10:30am, room (205?)
* Forum for file exchange and discussion: [http://uerj.tk uerj.tk]
* Evaluation criteria: 1 quizz at midterm (60%), practical projects (40%).
* Forum for file exchange and discussion: email and IRC #labmacambira


=== Pre-requisites ===
=== Pre-requisites ===
* Linux - intermediate to advanced (will be reviewed as needed) - read [[Literatura recomendada pela equipe]]
* Linux - intermediate to advanced (will be reviewed as needed) - read [[Literatura recomendada pela equipe|Recommended Reading]]
* C/C++ - intermediate to advanced (will be reviewed) - read [[Literatura recomendada pela equipe]]
* C/C++ - intermediate to advanced (will be reviewed) - read [[Literatura recomendada pela equipe|Recommended Reading]]
** [[Configuring Ubuntu for Programming]]
** [[C|C language review videos and exercises by prof. Fabbri]]
* Basic understanding of computer architecture - read  "Computer Systems: A Programmer's perspective" listed at [[Literatura recomendada pela equipe|Recommended Reading]].


== Approximate Content ==
== Approximate Content ==


The course focuses on software techniques for parallel computing.  We are aiming at a comprehensive treatment on different types of practical parallel programming techniques
The course focuses on software techniques for parallel computing.  We are aiming at a comprehensive treatment on different types of practical parallel programming techniques
* process-oriented parallel programming
 
'''Part I'''
* process-oriented parallel programming and concurrency
* thread programming/thread safety
* thread programming/thread safety
* single-core vector instructions
* single-core vector instructions
* multi-processor and multi-core programming
* multi-processor and multi-core programming
* mapreduce/hadoop
* MPI
* Cuda


Each of the above techniques would merit a course of their own, as done in many of the best universities. Therefore we aim at attaining a practical familiarity with each, in the first half of the course, and we will specialize in the later part of the course as to help the students' graduate research.
'''Part II'''
* High level parallel programming paradigms and language constructs: ''focus of the course''
** Recent technology to make parallel programming easier, more scalable and widely useful
** How languages such as Lua, Go, and Scilab and others support parallel concepts in powerful ways
* For interested students, we'll also cover the following
** mapreduce/hadoop
** MPI
** Cuda/OpenCL
** Dataflow techniques
 
Each of the above techniques would merit a course of their own, as done in many of the best universities. Therefore we aim at practicing fundamentals in the first half of the course, and we specialize in the later part of the course towards students' interests and graduate research.


== Main Resources ==
== Main Resources ==
Linha 28: Linha 44:
** 1st part of the course: [http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html "Is Parallel Programming Hard, and, if so, what can you do about it?" - Paul E. McKenney / IBM (editor)].
** 1st part of the course: [http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html "Is Parallel Programming Hard, and, if so, what can you do about it?" - Paul E. McKenney / IBM (editor)].
** For MPI: "An Introduction to Parallel Programming" , by Peter Pacheco[http://books.google.com.br/books?id=SEmfraJjvfwC&printsec=frontcover&hl=pt-BR&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false]
** For MPI: "An Introduction to Parallel Programming" , by Peter Pacheco[http://books.google.com.br/books?id=SEmfraJjvfwC&printsec=frontcover&hl=pt-BR&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false]
** For Cuda: Programming Massively Parallel Processors: A Hands-On Approach[http://books.google.com.br/books?id=qW1mncii_6EC&printsec=frontcover&hl=pt-BR#v=onepage&q&f=false]
** For Cuda/OpenCL: Programming Massively Parallel Processors: A Hands-On Approach[http://books.google.com.br/books?id=qW1mncii_6EC&printsec=frontcover&hl=pt-BR#v=onepage&q&f=false]
** General textbook: A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel Computing, Second Edition, Addison-Wesley, 2003.
** General textbook: A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel Computing, Second Edition, Addison-Wesley, 2003.
** Algorithms-oriented textbook: Algorithms: sequential, parallel, and distributed, Kenneth A. Berman, Jerome L. Paul
** Algorithms-oriented textbook: Algorithms: sequential, parallel, and distributed, Kenneth A. Berman, Jerome L. Paul
Linha 44: Linha 60:
** http://en.wikipedia.org/wiki/Hadoop
** http://en.wikipedia.org/wiki/Hadoop
* Rice lecture notes on Parallel Computing [http://www.clear.rice.edu/comp422/lecture-notes/index.html]
* Rice lecture notes on Parallel Computing [http://www.clear.rice.edu/comp422/lecture-notes/index.html]
* Raspberry Pi & Parallel computing (cheap and useful for building mobile clusters, education, learning mobile parallel programming, IoT + Parallel computing, prototyping the construction of larger clusters etc)
** http://www.networkworld.com/article/3156748/computers/10-amazing-raspberry-pi-clusters.html#slide2
** Learn to build and program for clusters with MPI, Python and Raspberry Pi: 
*** https://www.southampton.ac.uk/~sjc/raspberrypi/raspberry_pi_iridis_lego_supercomputer_paper_cox_Jun2013.pdf
*** http://www.meccanismocomplesso.org/en/cluster-e-programmazione-in-parallelo-con-mpi-e-raspberry-pi/
*** http://thundaxsoftware.blogspot.com.br/2016/07/creating-raspberry-pi-3-cluster.html
* Other Resources
* Other Resources
** http://www-users.cs.umn.edu/~karypis/parbook/
** http://www-users.cs.umn.edu/~karypis/parbook/
** http://www.cse.iitd.ernet.in/~subodh/courses/CSL860/
** http://www.cse.iitd.ernet.in/~subodh/courses/CSL860/
 
** Running Hadoop On Ubuntu Linux Multi and Single Node Cluster [http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/]
** Short course on CUDA at [http://webmail.iprj.uerj.br/xvemc/ emc 2012] conference: http://josericardojunior.com/?p=219


=== Lectures ===
=== Lectures ===
Linha 54: Linha 77:
# Overview of parallel computing: https://computing.llnl.gov/tutorials/parallel_comp/
# Overview of parallel computing: https://computing.llnl.gov/tutorials/parallel_comp/
# Review of Linux:  
# Review of Linux:  
## See the book Running Linux http://wiki.nosdigitais.teia.org.br/Literatura_recomendada_pela_equipe
## Do the [[Learning the Unix Programming Environment]] tutorial to brush up on your skills
# Review of C/C++
## See the book Running Linux from the [[Literatura_recomendada_pela_equipe| Recommended Reading]]
# Review of C/C++ (finished 28/Aug)
## [[C|C language review videos and exercises by prof. Fabbri]]
# Fundamental programming techniques: processes and threads
# Fundamental programming techniques: processes and threads
## Read The Unix Programming Environment for some classic multi-process programming[http://wiki.nosdigitais.teia.org.br/Literatura_recomendada_pela_equipe]
## Read The Unix Programming Environment for some classic multi-process programming[http://wiki.nosdigitais.teia.org.br/Literatura_recomendada_pela_equipe]
## What is a process: http://linuxgazette.net/133/saha.html
## 30/Aug: We'll be following the McKenney Book, ch 3 at the lab.
## Up to (and including) 18/Sept: We followed McKenney's code samples for counting algorithms and his Threading API in detail
### What is variable alignment: [http://www.songho.ca/misc/alignment/dataalign.html] [http://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Type-Attributes.html]
### How to determine the cache line size for your machine: [http://stackoverflow.com/questions/794632/programmatically-get-the-cache-line-size]. This will give you better results for threading algorithms.
### Install some documentation useful for POSIX threading:  sudo apt-get install glibc-doc manpages-posix-dev
## 20/Sept: Finished statistic counters and showed how to read assembly language generated by GCC, and how to disassemble binary objects
## 25/Sept: Finished theoretical thread content for the course: up until ch. 5 of McKenney's Book + Semaphores + Thread Safety.
### Homework 2 (see below)
# Cuda/OpenCL (02/Oct)
## Project 1 handout (see below) due Oct. 23
# Mapreduce/Hadoop
# Mapreduce/Hadoop
# MPI
# MPI
# Cuda


== Homework ==
== Homework ==
=== Homework 1 ===
'''We only accept binary document files in PDF or another open format such as .odt'''
Our insitute is planning on buying a supercomputer. Therefore, our first homework will consist on studying these systems in detail.
 
For programming assignments, the student must submit the source code together with documentation inside a single folder, compressed with .zip, tar.gz or tar.bz2, containing the student name, course and date.


* The class shall be divided into groups of 2 people.
All electronically submitted material must be handed in with the string ''[parallel]'' in the subject of the email. Expect to receive an automatic confirmation should evet
* Each group must pick one top 500 ((http://www.top500.org/) supercomputer and write on this wiki:
** Team number:


* Team 1: Julio Stutz & Joel Sanchez 
=== Homework: system specs (due 27Abr17) ===
** Supercomputer Tupã (Inpe/Brazil)
Make a drawing of a compute node (such as your computer) within a cluster environment, within an internet environment,
and show:
* Multiple CPUs, cores, cache levels, GPUs, the interconnects / buses, the hard drive, SSD, network adaptor, and a range of typical transfer rate between them using DMA and similar technology. Show the typical bandwidth of each interconnect and memory type.
* The typical parameters for each component (clock, typical number of cores, number of threads, typical amount of storage)
* The more parameters / detail you add, the greater the points you'll get for this assignment.
* In particular, you should answer whether each of these direct connections exist and what is the typical transfer speed
** CUDA-CUDA same node
** CUDA-CUDA different nodes
** Same for Xeon Phi


* Team 2: Claudir Oliver
=== Homework: Unix tutorial (half due 27Abr17, half due 4Mai17) ===
** Supercomputer: Tianhe-1A (China)
Go over the [[LUPE]] tutorial and send me the command history (~/.bash_history) as a text file. Edit the text file to contain the output of your actions.


* Team 3: Mateus Guida & Dario Sanchez
=== Homework: Shell + GNU Parallel (due 4Mai17) ===
** Supercomputer: Sequoia (Laborátório Nacional Lawrence Livermore - Califórnia - EUA)
Go over [http://www.gnu.org/software/parallel/parallel_tutorial.html Parallel's tutorial] and report your findings.
You'll receive random questions in class about it.


=== Homework: IPRJ Cluster system specs (due 26Jun17) ===
* Update the info on our cluster environment described [https://drive.google.com/drive/u/0/folders/0B8Z1cZSIN8qyQUVGdzhpS0dUZGM here].
* Confirm the info and update where necessary, and hand it in.


* Each class will start with a 20min presentation from a '''random''' group (or a volunteer group).
=== Homework: Compile a basic example of each technology (due 26Jun17) ===
* A handout must also be provided on Tue August 28
* Log onto the university cluster and compile a simple example of:
* Your project must not be a plain copy from wikipedia!
** MPI
** Dataflow
** Cuda
** OpenMP
** POSIX threads
** fork/join with shared memory
** Any other technology you are interested in.
Pick at least 5 technologies and show your results to the instructor next class by logging onto the cluster and running the basic examples.


=== Homework: limit counters ===
Describe the limit counter implementations of sec. 5.3 in McKenney's Book (2017), answering:
# list and explain the advantages of each counter
# explain the main concepts of each method in your words


Contents of the presentation and report
=== Homework question: statistic counters ===
* Overview of the system
# In-class assignment (extra .1 in the final test): Compare 3 different implementation of statistic counters: array of per-thread variables, eventuallly consistent counter and GCC __thread variables
* Glossary of terms
* '''How to program for the system'''
* Software infrastructure
* How does this differ from a conventional cluster.


=== Project 1 ===
=== Project ===
* The class will be divided into interest groups
* This project will consist on a practical programming problem from the student's research. The student is required to describe a problem from his research or from the instructor's suggestions and present a parallel implementation of some aspect of it
* The interest groups will each pick one of the following technologies or else propose another one of their liking.  
* This will be evaluated through actual code and accompanying monograph
** mapreduce/hadoop
* Grade: up to 4 extra points in the exam grade
** MPI
 
** Cuda  
==== Topic Suggestions ====
** Hybrid technologies (Distributed + Cuda)
For those who haven't found a suitable research topic with their own advisors,
** Languages designed for concurrency: [http://golang.org Google Go], Erlang, etc.
here is a list of possible ideas. These ideas can have some of my guidance
** Programming for the Playstation 3 parallel architecture[http://groups.csail.mit.edu/cag/ps3/index.shtml]
during the project, but please let your advisors know about this. Also,
* The project will consist in a series of presentations by the group members individually.
not all ideas may be very easy to develop a project on and may require
* Grade will be asigned to each presentation individually
some thinking on your part in terms of goals and objectives.
 
Try to email the original authors for the relevant papers for your project in order to get code.
You have to understand the paper, understand the code, and run the code. You will also
have to write code or a modification of the existing code, although you have limited time for this.
 
If you need any of the files below but can't access them from home, let me know
and I will email you a copy.
 
'''This list is outdated, but still interesting. I am yet to update it to 2017.'''
 
* [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5543612&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F5521877%2F5543135%2F05543612.pdf%3Farnumber%3D5543612 "Design and Implementation of a Wide Area, Large-Scale Camera Network", Kuo et. al. 2010]
** Technology: distributed computing / hadoop
** Abstract
  We describe a wide area camera network on a campus
  setting, the SCALLOPSNet (Scalable Large Optical Sensor
  Network). It covers with about 100 stationary cameras an
  expansive area that can be divided into three distinct re-
  gions: inside a building, along urban paths, and in a re-
  mote natural reserve. Some of these regions lack connec-
  tions for power and communications, and, therefore, ne-
  cessitate wireless, battery-powered camera nodes. In our
  exploration of available solutions, we found existing smart
  cameras to be insufficient for this task, and instead designed
  our own battery-powered camera nodes that communicate
  using 802.11b. The camera network uses the Internet Pro-
  tocol on either wired or wireless networks to communi-
  cate with our central cluster, which runs cluster and cloud
  computing infrastructure. These frameworks like Apache
  Hadoop are well suited for large distributed and parallel
  tasks such as many computer vision algorithms. We discuss
  the design and implementation details of this network, to-
  gether with the challenges faced in deploying such a large
  scale network on a research campus. We plan to make the
  datasets available for researchers in the computer vision
  community in the near future.
 
* "Web-Scale Computer Vision using MapReduce for Multimedia Data Mining", White, Yeh, Lin and Davis 2010
** Technology: Mapreduce
** Abstract: This work explores computer vision applications of the Map-
  Reduce framework that are relevant to the data mining com-
  munity. An overview of MapReduce and common design
  patterns are provided for those with limited MapReduce
  background. We discuss both the high level theory and the
  low level implementation for several computer vision algo-
  rithms: classifier training, sliding windows, clustering, bag-
  of-features, background subtraction, and image registration.
  Experimental results for the k-means clustering and single
  Gaussian background subtraction algorithms are performed
  on a 410 node Hadoop cluster.
 
 
* "Building Rome on a Cloudless Day", Frahm et al, 2012
** Technology: Cuda
** Abstract:
  This paper introduces an approach for dense 3D reconstruc-
  tion from unregistered Internet-scale photo collections with about 3 mil-
  lion of images within the span of a day on a single PC (“cloudless”). Our
  method advances image clustering, stereo, stereo fusion and structure
  from motion to achieve high computational performance. We leverage
  geometric and appearance constraints to obtain a highly parallel imple-
  mentation on modern graphics processors and multi-core architectures.
  This leads to two orders of magnitude higher performance on an order
  of magnitude larger dataset than competing state-of-the-art approaches.
 
===== Other interesting works =====
* "GraphLab: A Distributed Framework for Machine Learning in the Cloud", Low etal., 2011
  Machine Learning (ML) techniques are indispensable in
  a wide range of fields. Unfortunately, the exponential in-
  crease of dataset sizes are rapidly extending the runtime
  of sequential algorithms and threatening to slow future
  progress in ML. With the promise of affordable large-
  scale parallel computing, Cloud systems offer a viable
  platform to resolve the computational challenges in ML.
  However, designing and implementing efficient, provably
  correct distributed ML algorithms is often prohibitively
  challenging. To enable ML researchers to easily and effi-
  ciently use parallel systems, we introduced the GraphLab
  abstraction which is designed to represent the computa-
  tional patterns in ML algorithms while permitting effi-
  cient parallel and distributed implementations.
  In this paper we provide a formal description of the
  GraphLab parallel abstraction and present an efficient
  distributed implementation. We conduct a comprehen-
  sive evaluation of GraphLab on three state-of-the-art ML
  algorithms using real large-scale data and a 64 node
  EC2 cluster of 512 processors. We find that GraphLab
  achieves orders of magnitude performance gains over
  Hadoop while performing comparably or superior to
  hand-tuned MPI implementations.
 
* "Parallel Data Mining from Multicore to Cloudy Grids", Fox et. al.
  We describe a suite of data mining tools that cover clustering,
  information retrieval and the mapping of high dimensional data to low dimensions
  for visualization. Preliminary applications are given to particle physics,
  bioinformatics and medical informatics. The data vary in dimension from low (2-
  20), high (thousands) to undefined (sequences with dissimilarities but not vectors
  defined). We use deterministic annealing to provide more robust algorithms that
  are relatively insensitive to local minima. We discuss the algorithm structure and
  their mapping to parallel architectures of different types and look at the
  performance of the algorithms on three classes of system; multicore, cluster and
  Grid using a MapReduce style algorithm. Each approach is suitable in different
  application scenarios. We stress that data analysis/mining of large datasets can be
  a supercomputer application.
* "Fast face tracking using parallel particle filter algorithm", Liu et. al., 2009
 
  This paper proposed a multi-cue based face tracking
  algorithm with the help of parallel multi-core processing.
  Due to illumination and occlusion problems, face tracking
  usually does not work stably based on a single cue. Three
  different visual cues, color histogram, edge orientation
  histogram and wavelet feature, are integrated under the
  framework of particle filter to improve the tracking
  performance considerably. To handle the huge amount of
  computation cost resulted from the introduced multi-cue
  strategy, a map-reduce thread model is designed to parallel
  and speed up the observation steps. Besides, an online
  updating strategy makes our algorithm adaptable to some
  slight face rotations. The experimental results demonstrate
  that our proposed face tracking algorithm works robustly for
  cluttered backgrounds and different illuminations. The
  multi-core parallel scheme achieves a good linear speedup
  compared to the corresponding sequential algorithms.
 
 
* "CudaGIS: Report on the Design and Realization of a Massive Data Parallel GIS on GPUs", Zhang, You,
  We report the design and realization of a high-
  performance parallel GIS, i.e., CudaGIS, based on the General
  Purpose computing on Graphics Processing Units (GPGPU)
  technologies. Still under active developments, CudaGIS currently
  supports major types of geospatial data (point, polyline, polygon
  and raster) and provides modules for spatial indexing, spatial join
  and other types of geospatial operations on such geospatial data
  types. Experiments have demonstrated 20-40X and 1000-10000X
  speedups over serial CPU implementations on main-memory and
  disk-resident systems, respectively.
 
* "A Parallel Clustering Method Study Based on MapReduce", Zhanquan and Fox
 
  Clustering is considered as the most important task in data mining. The goal
  of clustering is to determine the intrinsic grouping in a set of unlabeled
  data. Many practical application problems should be solved with clustering
  method. It has been widely applied into all kinds of areas, such marketing,
  biology, library, insurance, earth-quake study, and World Wide Web and so on.
  Many clustering methods have been studied, such as k-means, Fisher clustering,
  and Koehon clustering and so on. In many kinds of areas, the scale of data set
  becomes larger and larger. Classical clustering method will not work to deal
  with large scale data set. The study of clustering methods based on large
  scale data is considered as an important task. MapReduce is taken as the most
  efficient model to deal with data intensive problems. Many data mining methods
  based on MapReduce have been studied. In this paper, parallel clustering
  method based on MapReduce is studied. The research mainly contributes the
  following aspects.  Firstly, it determines the initial center objectively.
  Secondly, information loss is taken as the distance metric between two
  samples. Thirdly, the clustering results are visualized with interpolation MDS
  method. The efficiency of the method is illustrated with a practical DNA
  clustering problem.
 
* "Sequential and mapreduce-based algorithms for constructing an in-place multidimensional quad-tree index for answering fixed-radius nearest neighbor queries", Andreica and Tapus
 
  Answering fixed-radius nearest neighbor queries constitutes an
  important problem in many areas, ranging from geographic systems to similarity
  searching in object databases (e.g. image and video databases). The usual ap-
  proach in order to efficiently answer such queries is to construct an index. In this
  paper we present algorithms for constructing a multidimensional quad-tree index.
  We start with well-known sequential algorithms and then adapt them to the MapRe-
  duce computation model, in order to be able to handle large amounts of data. In all
  the algorithms the objects are indexed in association with quad-tree cells (or nodes)
  which they intersect (plus possibly a few other nearby cells). When processing a
  query, multiple quad-tree cells may be searched in order to find the answer.
 
* "Fast Homing Techniques for Autonomous Robots using Sparse Image Waypoints and Design of Vision-based Indoor Localization as Cloud Computing Services", PhD Thesis, 2011
 
* "Building and Using a Database of One Trillion Natural-Image Patches", Arietta and Lawrence, IEEECGA 2011


=== Project 2 ===
====Previous Years Projects and Homeworks ====
* This project will consist on a practical programming problem from the
student's research. The student is required to describe a problem from his
research and present a parallel implementation of some aspect of it.
* This will be evaluated through actual code and acompanying monograph.


* '''MPI''': Claudir Oliveira - 25/10/2012 ([http://www.lems.brown.edu/~rfabbri/stuff/parallel-computing/technologies/oliveira/apresentacao.pdf slides]) ([http://www.lems.brown.edu/~rfabbri/stuff/parallel-computing/technologies/oliveira/codigo/ code])
* '''Hybrid technologies (Distributed + Cuda)''': Ricardo Dias  ([http://www.lems.brown.edu/~rfabbri/stuff/parallel-computing/technologies/dias/apresentacao.pdf slides]) ([http://www.lems.brown.edu/~rfabbri/stuff/parallel-computing/technologies/dias/codigo/ code])
* '''Cuda/OpenCL''': Joel Sanchez ([http://www.lems.brown.edu/~rfabbri/stuff/parallel-computing/technologies/sanchez/gpu_cuda.pdf slides]) ([http://www.lems.brown.edu/~rfabbri/stuff/parallel-computing/technologies/sanchez/codigo/ code])
* '''OPENGL/GLSL''' : Luiz Rosalba - 08/11/2012 ([http://www.lems.brown.edu/~rfabbri/stuff/parallel-computing/technologies/rosalba/glsl_opengl_apresenta.pdf slides])
* '''mapreduce/hadoop''' : Julio Stutz ([http://www.lems.brown.edu/~rfabbri/stuff/parallel-computing/technologies/stutz/mapreduce-hadoop.pdf slides])
* '''Google Go''' : Mateus Guida ([http://www.lems.brown.edu/~rfabbri/stuff/parallel-computing/technologies/guida/google_go.pdf slides])
* '''Programming for the Playstation 3''' parallel architecture: Thiago Luiz


=== Keywords ===
Portuguese: Programação Paralela, Introdução à Computação Paralela


[[Category:Lab Macambira]] [[Category:IPRJ]]
[[Category:Lab Macambira]] [[Category:IPRJ]] [[Category:Redes]]

Edição atual tal como às 14h02min de 23 de outubro de 2020

This is the main page of a graduate and senior undergraduate level course in parallel computing being taught in 2017 (semester 2016/2 due to a series of work stoppages) at the Polytechnic Institute IPRJ/UERJ. It is generally useful for programmers at the advanced level in the fields of scientific and multimedia programming.

Matrix6.png
  • Links to the course pages for previous years: 2015 2012

General Info

  • Instructor: prof. Ricardo Fabbri, Ph.D. Brown University
  • Meeting times: Tuesdays 2:20pm-4pm Thursdays 8:50am - 10:30am, room (205?)
  • Evaluation criteria: 1 quizz at midterm (60%), practical projects (40%).
  • Forum for file exchange and discussion: email and IRC #labmacambira

Pre-requisites

Approximate Content

The course focuses on software techniques for parallel computing. We are aiming at a comprehensive treatment on different types of practical parallel programming techniques

Part I

  • process-oriented parallel programming and concurrency
  • thread programming/thread safety
  • single-core vector instructions
  • multi-processor and multi-core programming

Part II

  • High level parallel programming paradigms and language constructs: focus of the course
    • Recent technology to make parallel programming easier, more scalable and widely useful
    • How languages such as Lua, Go, and Scilab and others support parallel concepts in powerful ways
  • For interested students, we'll also cover the following
    • mapreduce/hadoop
    • MPI
    • Cuda/OpenCL
    • Dataflow techniques

Each of the above techniques would merit a course of their own, as done in many of the best universities. Therefore we aim at practicing fundamentals in the first half of the course, and we specialize in the later part of the course towards students' interests and graduate research.

Main Resources

Lectures

Partial listing & Tentative Outline

  1. Overview of parallel computing: https://computing.llnl.gov/tutorials/parallel_comp/
  2. Review of Linux:
    1. Do the Learning the Unix Programming Environment tutorial to brush up on your skills
    2. See the book Running Linux from the Recommended Reading
  3. Review of C/C++ (finished 28/Aug)
    1. C language review videos and exercises by prof. Fabbri
  4. Fundamental programming techniques: processes and threads
    1. Read The Unix Programming Environment for some classic multi-process programming[5]
    2. What is a process: http://linuxgazette.net/133/saha.html
    3. 30/Aug: We'll be following the McKenney Book, ch 3 at the lab.
    4. Up to (and including) 18/Sept: We followed McKenney's code samples for counting algorithms and his Threading API in detail
      1. What is variable alignment: [6] [7]
      2. How to determine the cache line size for your machine: [8]. This will give you better results for threading algorithms.
      3. Install some documentation useful for POSIX threading: sudo apt-get install glibc-doc manpages-posix-dev
    5. 20/Sept: Finished statistic counters and showed how to read assembly language generated by GCC, and how to disassemble binary objects
    6. 25/Sept: Finished theoretical thread content for the course: up until ch. 5 of McKenney's Book + Semaphores + Thread Safety.
      1. Homework 2 (see below)
  5. Cuda/OpenCL (02/Oct)
    1. Project 1 handout (see below) due Oct. 23
  6. Mapreduce/Hadoop
  7. MPI

Homework

We only accept binary document files in PDF or another open format such as .odt

For programming assignments, the student must submit the source code together with documentation inside a single folder, compressed with .zip, tar.gz or tar.bz2, containing the student name, course and date.

All electronically submitted material must be handed in with the string [parallel] in the subject of the email. Expect to receive an automatic confirmation should evet

Homework: system specs (due 27Abr17)

Make a drawing of a compute node (such as your computer) within a cluster environment, within an internet environment, and show:

  • Multiple CPUs, cores, cache levels, GPUs, the interconnects / buses, the hard drive, SSD, network adaptor, and a range of typical transfer rate between them using DMA and similar technology. Show the typical bandwidth of each interconnect and memory type.
  • The typical parameters for each component (clock, typical number of cores, number of threads, typical amount of storage)
  • The more parameters / detail you add, the greater the points you'll get for this assignment.
  • In particular, you should answer whether each of these direct connections exist and what is the typical transfer speed
    • CUDA-CUDA same node
    • CUDA-CUDA different nodes
    • Same for Xeon Phi

Homework: Unix tutorial (half due 27Abr17, half due 4Mai17)

Go over the LUPE tutorial and send me the command history (~/.bash_history) as a text file. Edit the text file to contain the output of your actions.

Homework: Shell + GNU Parallel (due 4Mai17)

Go over Parallel's tutorial and report your findings. You'll receive random questions in class about it.

Homework: IPRJ Cluster system specs (due 26Jun17)

  • Update the info on our cluster environment described here.
  • Confirm the info and update where necessary, and hand it in.

Homework: Compile a basic example of each technology (due 26Jun17)

  • Log onto the university cluster and compile a simple example of:
    • MPI
    • Dataflow
    • Cuda
    • OpenMP
    • POSIX threads
    • fork/join with shared memory
    • Any other technology you are interested in.

Pick at least 5 technologies and show your results to the instructor next class by logging onto the cluster and running the basic examples.

Homework: limit counters

Describe the limit counter implementations of sec. 5.3 in McKenney's Book (2017), answering:

  1. list and explain the advantages of each counter
  2. explain the main concepts of each method in your words

Homework question: statistic counters

  1. In-class assignment (extra .1 in the final test): Compare 3 different implementation of statistic counters: array of per-thread variables, eventuallly consistent counter and GCC __thread variables

Project

  • This project will consist on a practical programming problem from the student's research. The student is required to describe a problem from his research or from the instructor's suggestions and present a parallel implementation of some aspect of it
  • This will be evaluated through actual code and accompanying monograph
  • Grade: up to 4 extra points in the exam grade

Topic Suggestions

For those who haven't found a suitable research topic with their own advisors, here is a list of possible ideas. These ideas can have some of my guidance during the project, but please let your advisors know about this. Also, not all ideas may be very easy to develop a project on and may require some thinking on your part in terms of goals and objectives.

Try to email the original authors for the relevant papers for your project in order to get code. You have to understand the paper, understand the code, and run the code. You will also have to write code or a modification of the existing code, although you have limited time for this.

If you need any of the files below but can't access them from home, let me know and I will email you a copy.

This list is outdated, but still interesting. I am yet to update it to 2017.

  We describe a wide area camera network on a campus
 setting, the SCALLOPSNet (Scalable Large Optical Sensor
 Network). It covers with about 100 stationary cameras an
 expansive area that can be divided into three distinct re-
 gions: inside a building, along urban paths, and in a re-
 mote natural reserve. Some of these regions lack connec-
 tions for power and communications, and, therefore, ne-
 cessitate wireless, battery-powered camera nodes. In our
 exploration of available solutions, we found existing smart
 cameras to be insufficient for this task, and instead designed
 our own battery-powered camera nodes that communicate
 using 802.11b. The camera network uses the Internet Pro-
 tocol on either wired or wireless networks to communi-
 cate with our central cluster, which runs cluster and cloud
 computing infrastructure. These frameworks like Apache
 Hadoop are well suited for large distributed and parallel
 tasks such as many computer vision algorithms. We discuss
 the design and implementation details of this network, to-
 gether with the challenges faced in deploying such a large
 scale network on a research campus. We plan to make the
 datasets available for researchers in the computer vision
 community in the near future.
  • "Web-Scale Computer Vision using MapReduce for Multimedia Data Mining", White, Yeh, Lin and Davis 2010
    • Technology: Mapreduce
    • Abstract: This work explores computer vision applications of the Map-
 Reduce framework that are relevant to the data mining com-
 munity. An overview of MapReduce and common design
 patterns are provided for those with limited MapReduce
 background. We discuss both the high level theory and the
 low level implementation for several computer vision algo-
 rithms: classifier training, sliding windows, clustering, bag-
 of-features, background subtraction, and image registration.
 Experimental results for the k-means clustering and single
 Gaussian background subtraction algorithms are performed
 on a 410 node Hadoop cluster.


  • "Building Rome on a Cloudless Day", Frahm et al, 2012
    • Technology: Cuda
    • Abstract:
 This paper introduces an approach for dense 3D reconstruc- 
 tion from unregistered Internet-scale photo collections with about 3 mil- 
 lion of images within the span of a day on a single PC (“cloudless”). Our 
 method advances image clustering, stereo, stereo fusion and structure 
 from motion to achieve high computational performance. We leverage 
 geometric and appearance constraints to obtain a highly parallel imple- 
 mentation on modern graphics processors and multi-core architectures. 
 This leads to two orders of magnitude higher performance on an order 
 of magnitude larger dataset than competing state-of-the-art approaches.
Other interesting works
  • "GraphLab: A Distributed Framework for Machine Learning in the Cloud", Low etal., 2011
  Machine Learning (ML) techniques are indispensable in
 a wide range of fields. Unfortunately, the exponential in-
 crease of dataset sizes are rapidly extending the runtime
 of sequential algorithms and threatening to slow future
 progress in ML. With the promise of affordable large-
 scale parallel computing, Cloud systems offer a viable
 platform to resolve the computational challenges in ML.
 However, designing and implementing efficient, provably
 correct distributed ML algorithms is often prohibitively
 challenging. To enable ML researchers to easily and effi-
 ciently use parallel systems, we introduced the GraphLab
 abstraction which is designed to represent the computa-
 tional patterns in ML algorithms while permitting effi-
 cient parallel and distributed implementations.
 In this paper we provide a formal description of the
 GraphLab parallel abstraction and present an efficient
 distributed implementation. We conduct a comprehen-
 sive evaluation of GraphLab on three state-of-the-art ML
 algorithms using real large-scale data and a 64 node
 EC2 cluster of 512 processors. We find that GraphLab
 achieves orders of magnitude performance gains over
 Hadoop while performing comparably or superior to
 hand-tuned MPI implementations.
  • "Parallel Data Mining from Multicore to Cloudy Grids", Fox et. al.
  We describe a suite of data mining tools that cover clustering,
 information retrieval and the mapping of high dimensional data to low dimensions
 for visualization. Preliminary applications are given to particle physics,
 bioinformatics and medical informatics. The data vary in dimension from low (2-
 20), high (thousands) to undefined (sequences with dissimilarities but not vectors
 defined). We use deterministic annealing to provide more robust algorithms that
 are relatively insensitive to local minima. We discuss the algorithm structure and
 their mapping to parallel architectures of different types and look at the
 performance of the algorithms on three classes of system; multicore, cluster and
 Grid using a MapReduce style algorithm. Each approach is suitable in different
 application scenarios. We stress that data analysis/mining of large datasets can be
 a supercomputer application.

  • "Fast face tracking using parallel particle filter algorithm", Liu et. al., 2009
 This paper proposed a multi-cue based face tracking
 algorithm with the help of parallel multi-core processing.
 Due to illumination and occlusion problems, face tracking
 usually does not work stably based on a single cue. Three
 different visual cues, color histogram, edge orientation
 histogram and wavelet feature, are integrated under the
 framework of particle filter to improve the tracking
 performance considerably. To handle the huge amount of
 computation cost resulted from the introduced multi-cue
 strategy, a map-reduce thread model is designed to parallel
 and speed up the observation steps. Besides, an online
 updating strategy makes our algorithm adaptable to some
 slight face rotations. The experimental results demonstrate
 that our proposed face tracking algorithm works robustly for
 cluttered backgrounds and different illuminations. The
 multi-core parallel scheme achieves a good linear speedup
 compared to the corresponding sequential algorithms.


  • "CudaGIS: Report on the Design and Realization of a Massive Data Parallel GIS on GPUs", Zhang, You,
 We report the design and realization of a high-
 performance parallel GIS, i.e., CudaGIS, based on the General
 Purpose computing on Graphics Processing Units (GPGPU)
 technologies. Still under active developments, CudaGIS currently
 supports major types of geospatial data (point, polyline, polygon
 and raster) and provides modules for spatial indexing, spatial join
 and other types of geospatial operations on such geospatial data
 types. Experiments have demonstrated 20-40X and 1000-10000X
 speedups over serial CPU implementations on main-memory and
 disk-resident systems, respectively.
  • "A Parallel Clustering Method Study Based on MapReduce", Zhanquan and Fox
 Clustering is considered as the most important task in data mining. The goal
 of clustering is to determine the intrinsic grouping in a set of unlabeled
 data. Many practical application problems should be solved with clustering
 method. It has been widely applied into all kinds of areas, such marketing,
 biology, library, insurance, earth-quake study, and World Wide Web and so on.
 Many clustering methods have been studied, such as k-means, Fisher clustering,
 and Koehon clustering and so on. In many kinds of areas, the scale of data set
 becomes larger and larger. Classical clustering method will not work to deal
 with large scale data set. The study of clustering methods based on large
 scale data is considered as an important task. MapReduce is taken as the most
 efficient model to deal with data intensive problems. Many data mining methods
 based on MapReduce have been studied. In this paper, parallel clustering
 method based on MapReduce is studied. The research mainly contributes the
 following aspects.  Firstly, it determines the initial center objectively.
 Secondly, information loss is taken as the distance metric between two
 samples. Thirdly, the clustering results are visualized with interpolation MDS
 method. The efficiency of the method is illustrated with a practical DNA
 clustering problem.
  • "Sequential and mapreduce-based algorithms for constructing an in-place multidimensional quad-tree index for answering fixed-radius nearest neighbor queries", Andreica and Tapus
 Answering fixed-radius nearest neighbor queries constitutes an
 important problem in many areas, ranging from geographic systems to similarity
 searching in object databases (e.g. image and video databases). The usual ap-
 proach in order to efficiently answer such queries is to construct an index. In this
 paper we present algorithms for constructing a multidimensional quad-tree index.
 We start with well-known sequential algorithms and then adapt them to the MapRe-
 duce computation model, in order to be able to handle large amounts of data. In all
 the algorithms the objects are indexed in association with quad-tree cells (or nodes)
 which they intersect (plus possibly a few other nearby cells). When processing a
 query, multiple quad-tree cells may be searched in order to find the answer.
  • "Fast Homing Techniques for Autonomous Robots using Sparse Image Waypoints and Design of Vision-based Indoor Localization as Cloud Computing Services", PhD Thesis, 2011
  • "Building and Using a Database of One Trillion Natural-Image Patches", Arietta and Lawrence, IEEECGA 2011

Previous Years Projects and Homeworks

  • MPI: Claudir Oliveira - 25/10/2012 (slides) (code)
  • Hybrid technologies (Distributed + Cuda): Ricardo Dias (slides) (code)
  • Cuda/OpenCL: Joel Sanchez (slides) (code)
  • OPENGL/GLSL : Luiz Rosalba - 08/11/2012 (slides)
  • mapreduce/hadoop : Julio Stutz (slides)
  • Google Go : Mateus Guida (slides)
  • Programming for the Playstation 3 parallel architecture: Thiago Luiz

Keywords

Portuguese: Programação Paralela, Introdução à Computação Paralela