Frequently Asked Questions Regarding Computing in the Center for Theoretical Biological Physicslim3@rice.edu
Q: How to apply for a computing account on supercomputers at Rice?
A: You may use your NetID to apply for a computing account or accounts on the supercomputers at Rice. You can choose to apply for a regular user account (for most students and research staff) or a guest account (for external collaborators and guests not directly affiliated with Rice University).
Remember to supply exactly the required information, especially your account sponsor's information.
For most CTBP users, the sponsor is our Project Manager:
Before you apply for an account, you should determine which supercomputer is best suited for your computing needs. You may learn about Rice supercomputers through Research Computing.
After you submit your application, your sponsor will review and decide to approve or disapprove your application. The system administrator of supercomputers will notify you when your account is ready. To apply for an account please visit the Research Computing Support Group.
Q: What do I do if I have a computing question or problem?
A: Submit a Helpdesk Ticket with the Research Computing Support Group.
When submitting a ticket, it is best to describe your problem as specifically as possible, e.g. what is the question (program compiling, job submitting, computing error, etc.), which supercomputer, how did the problem happen, where are the input files, output error message, etc.
To submit a ticket please visit Request Help for Research Computing Resources.
You may also contact Xiaoqin Huang for help.
BRC Room#: 1061
For questions about your desktop/laptop, you may also submit a ticket through the IT Help Desk.
You may also try to help yourself by searching for information online related to your particular program or question, e.g. for questions using GROMACS, these sites are very helpful:
Looking for possible clues or solutions in multiple ways will help you become more familiar with the supercomputer and the programs you are using, therefor solving your questions more efficiently.
Q: What computing resources are available for use at Rice University?
A: For High Performance Computing (HPC), the Research Computing Support Group maintains a collection of shared computing facilities that are available to all Rice-affiliated researchers. There are currently four supercomputers at Rice University:
The documentation of these supercomputers is available at the Research Computing Research Computing Support Group page. Â
BlueGene/P:Â The Rice Bluegene P is a massively parallel supercomputer featuring 24,576 Power PC 450 compute cores.Â Each are 32 bit running at 850MHz.
The system has 4GB of RAM at each node, and totally 260TB of GPFS shared storage.
The parallel jobs are scheduled through Loadledveler.
DAVinCI: The DAVinCI system is an IBM iDataPlex consisting of 2304 processor cores in 192 Westmere nodes (12 processor cores per node) at 2.83 GHz with 48 GB of RAM per node (4 GB per core).Â All of the nodes are connected via QDR InfiniBand (40 Gb/s ) both to each other and to the GPFS fast scratch storage system. Â
16 nodes with NVIDIA Fermi GPGPUs.
Both parallel and serial jobs can be submitted through PBS.
BlueBioU:Â Blue BioU is the result of a ground-breaking collaboration between Rice University and IBM that aims to provide large-memory highly threaded computing to the Texas Medical Center. Standing now at 47 IBM Power 755 nodes, Blue BioU is just under half the size of the famous Watson supercomputer that made news by beating human champions at the television game show, Jeopardy! . Each node contains four eight-core POWER7 chips running at 3.86GHz. Each core runs four simultaneous multithreaded hardware threads, giving each node a total of 128 schedule-able processor units. BioU sports the largest memory profile of our systems, with 256GB RAM per node, or 8GB per core.
The whole cluster has 6016 threads.
Parallel and serial jobs can be submitted via PBS.
STIC: STIC stands for Shared Tightly-Integrated Cluster. STIC consists of 170 Appro Greenblade E5530 nodes each with two quad-core 2.4GHz Intel Xeon (Nahalem) CPUs as well as 44 Appro Greenblade E5650 nodes with two six-core 2.6GHz Intel Xeon Westmere CPUs. This gives the system a total of 1920 compute cores. There is a maximum of 720 compute cores available to all users and is subject to change due to special projects, maintenance tasks, and so on. The remaining cores are part of a Research Computing Resort Condominium. Each node has 12GB of memory per node shared by all cores on the node. All jobs are submitted by using SLURM (which is different from PBS).
The shared computing resources at Rice University are updated frequently, please visit Research Computing Support Group for the latest information of HPC clusters and their related documentation.
Q: What can I do after my account is open?
A: Please consider or ask yourself the following:
Q: How to compile a particular program for my research projects?
Coding and compiling are important for computing.
A: As coding and compiling are important for computing the the process below should be followed:
To select proper libraries, there are two series of libraries, one series of libraries come for the compilers, e.g. on DAVinCI, the library intel64 corresponds to Intel compilers at /opt/apps/Intel/2013.1.039/lib/intel64/; another series of libraries are usually required by a particular program, e.g. FFTW and HDF5 and areÂ commonly used by a lot of software packages.
It is best to use one set of compilers consistently throughout the whole compiling process, and to ensure that all compiling flags are compatible with one another. For more detailed instructions on compiling programs on BlueGene/P, visit Compiling and Running Specific Applications on Blue Gene/P.
Q: How do I find and link application libraries?
A: Each supercomputer has installed a set of application libraries/tools, and mostly put at the path as: /opt/apps/, e.g. on DAVinCI, under the path /opt/apps, the application libraries include BOOST, FFTW, HDF5, NETCDF, PYTHON3, etc. For a particular library, you may ask for help to build, or try to build by yourself. To link a library, use "-I" and/or "-L" flags, e.g. "-I/opt/apps/Intel/2011.0.013/mkl/include/fftw" to use the FFTW of Intelâ€™s mkl library, "-L/opt/apps/Intel/2011.0.013/mkl/lib/intel64" to link the intel64 library. To correctly link the libraries is critical to successfully and correctly compile a particular program. More research will be helpful before selecting and compiling the libraries for a particular program.
Q: How to know that my program runs correctly and efficiently?
A: After the program is compiled, use a very typical case to test run and check to see the results are reasonable and correct. A very typical case means an example with the least but the most confident number of parameters in your input file, and this can be done in a short time, e.g. half hour. In order to run your program more efficiently, you are suggested to do benchmark, especially for parallelized program, e.g. using different number of CPUs to see the time-need to finish certain amount of output data, or how much data can be generated during certain amount of time. This site has an example for the performance of NAMD on BlueGene/P supercomputer: https://docs.rice.edu/confluence/pages/viewpage.action?pageId=36806253.
Q: How to identify possible errors and debug a program?
A: It is usually not easy to identify the possible source of error of a program. For parallel jobs, these signals as listed here are typical, but partly reasonable. Signal 6: SIGABRT, a job died immediately after submitted and no or very little contents in the output file, one possible reason is that the executable image is too large to load. Signal 7: SIGBUS, a job terminated unexpectedly with a message "killed with signal 7", indicates the program experienced an unhandled alignment error. This error could occur when an improperly memory aligned data value is accessed. Signal 9 or Signal 11: SIGKILL, a job terminated in this way, possibly the job ran past its allotted time and was killed by the scheduler. In C/C++ program, this also possibly indicates a pointer pointed to some area of code within the program, which it should not (Signal 9), or a pointer pointed to a location in memory outside of the program space (Signal 11). Signal 10: this is rare in the LINUX/UNIX system, possibly indicates a "bus error", and comes from incorrect assembly instructions being written to CPU. This error could also happen when using the wrong "bit" compiler, e.g. use a 64-bit compiler on a 32-bit platform. Signal 13: possibly pipe failure, that is, one process is trying to write to a process but there is no process to receive the data. To exit all processes and restart the program run, to see it helps or not. To debug a program, you need to put "-g" flag in your compiling option or script, to allow the compiler to collect the debugging information. After that, one way to debug is to do manually, i.e. read through the code piece by piece, and print out the supposed output, to see where the program goes wrong or stop, then fix the error by modify/re-write the code. Another way is to use some tools, e.g. gdb (at /usr/bin/gdb), idb (/opt/apps/Intel/2013.1.039/bin/idb on DAVinCI), to set breakpoint to locate lines of your code for possible errors; valgrind (download from http://valgrind.org/ and install) to detect possible memory leaks; totalview (available at DAVinCI /opt/apps/totalview) is a GUI-based source code defect analysis tool, which can be used to debug one or many processes/threads.
Q: What is MPI?
A: MPI, the Message Parsing Interface, is a library standard designed specifically for parallel computing, which helps to move the data from the address space of one process to that of another process through cooperative operations on each process. It provides a means to enable computing communication between different processors. The MPI library has been implemented to all the clusters, and can be found by "module avail" to see which version of MPI is available (e.g. openmpi/1.6.5-intel). To use MPI to run a parallel program, use either "mpiexec" or "mpirun" in your job script after "module load openmpi/1.6.5-intel", see more details of documentation for each cluster at this site: https://docs.rice.edu/confluence/display/ITDIY/Research+Computing. To parallel a code/program, follows six steps: (1) to include the MPI header file, e.g. include "mpif.h" or include "mpi.h" in C/C++; (2) to get MPI started, i.e. "MPI_Init (&argc, &argv)"; (3) to decide how many MPI tasks and the master "ID", e.g. "MPI_Comm_size(MPI_COMM_WORLD, &nprocs)"; "MPI_Comm_rank(MPI_COMM_WORLD, &myid)"; (4) to send out data to all the computing processes by MPI_Bcast or MPI_Send, e.g. "MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD)"; where the "&n" is the starting address, "1" is the number of entries, "MPI_INT" is the data type, "0" is the rank of broadcast root. "MPI_COMM_WORLD" is the communicator. MPI_Send must be used together with MPI_Recv; (5) to receive data at each process by "MPI_Recv", or collect data from all the processes by "MPI_Reduce" after computing, e.g. "MPI_Reduce (&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD)"; at which "&mypi" is the data address to send from each process, "&pi" is the data address to receive, "1" is the number of data to collect, "MPI_DOUBLE" is the data type, "MPI_SUM" is to sum up all the data from all the processes, "0" is the rank of root where the summed data to go to, "MPI_COMM_WORLD" is the communicator. (6) To get MPI stopped after calculation is finished, i.e. "MPI_Finalize()" The key points are: (a) all the MPI tasks have to call "MPI_Init" and "MPI_Finalize", and these two functions can be called only one time in the whole code/program, i.e. no MPI calls are allowed outside the region between "MPI_Init" and "MPI_Finalize". This is true for all kinds of program parallelization, no matter how big and how many of subroutines of a program or software package has; (b) MPI functions of data sending, receiving and/or collecting (e.g. MPI_Send, MPI_Recv, MPI_Reduce) can be used as many times as needed and can be scattered everywhere inside the code/program. A simple example of how to calculate the pi value using multiple processes is located on DAVinCI at the path as: /projects/kimba/xh14/From-davinci-scratch/mypi.cc To learn more about MPI, these information are helpful: classes of "COMP 322", "COMP 422", and "COMP 522" at Rice Univ. https://computing.llnl.gov/tutorials/mpi/ https://docs.rice.edu/confluence/display/ITDIY/Research+Computing
Q: What is GPGPU?
A: GPGPU, General Purpose computing on Graphics Processing Unites, is a methodology that handles the high performance computing with the properties of highly data parallel and intensive throughput. Highly data parallel means that all processors can simultaneously operate on different data elements, and intensive throughput means that the algorithm will process lots of data elements, ensuring huge data elements to be operated on parallel. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model, which is created by NVIDIA. The CUDA platform is accessible to software developers through CUDA-accelerated libraries, compiler directives, and extensions to programming languages, e.g. C, C++ (CUDA C/C++, i.e. nvcc compiler), and Fortran (PGI CUDA Fortran compiler from Portland Group, i.e. pgf compilers). This book â€śProgramming Massively Parallel Processors, second edition: A Hands-on Approachâ€ť (by David B. Kirk and Wen-mei W. Hwu from UIUC) explains well the theory and concepts about CUDA parallel computing. This book "CUDA by Example: An Introduction to General-Purpose GPU Programmingâ€ť (by Jason Sanders and Edward Kandrot) is very good to do practice of CUDA programming. The NVIDIA website has a lot of information about GPU calculations (https://developer.nvidia.com/cuda-zone). GPUs on DAVinCI cluster at Rice: DAVinCI system is consisted of 192 Westmere nodes (12 processors per node) with 48GB of RAM per node, and six Sandy Bridge nodes (16 processors per node) with 128 GB of RAM per node. Sixteen Westmere nodes of DAVinCI are quipped with NVIDIA Fermi GPGPUs, and each of these 16 nodes has two Tesla M2050 GPU cards, designated as â€śgraphicsâ€ť queue., i.e. 32 GPUs in total. RCSG has offered an introduction about GPGPU computing on DAVinCI: https://docs.rice.edu/confluence/display/ITDIY/Getting+Started+on+DAVinCI. Programs with GPGPU Computing: A number of programs/software packages have implemented GPGPU computing, e.g. AMBER with PMEMD module (http://ambermd.org/gpus/benchmarks.htm); GROMACS with new algorithm targeting SIMD/streaming architectures and accelerated non-bonded forces calculations (http://www.gromacs.org/GPU_acceleration); LAMMPS with user-packages (both lib/cuda and lib/cuda) http://lammps.sandia.gov/doc/Section_accelerate.html; NAMD with re-coded non-bonded forces calculations (http://www.ks.uiuc.edu/Research/gpu/); Others like MATLAB (http://www.mathworks.com/discovery/matlab-gpu.html), etc. Tested examples: Some examples of GPGPU computing by programs as NAMD, GROMACS, LAMMPS are located on DAVinCI at the path: /projects/kimba/xh14/From-davinci-scratch/, and a pdf file is also over there, about how to compile and how to submit jobs.