DATASTORM (Large-Scale Data Management in Cloud Environments) is an initiative, lead by the Information Decision and Support Systems (IDSS) Unit of INESC-ID, for creating a critical mass of scientists and engineers for addressing the design, implementation and operation of the new wave of large-scale data-intensive software systems.
- 2014-05-05 DataStorm Big Data Summer School
- 2013-07-01 Project Official Start today.
Over the last decade a number of complex problems have emerged, involving the computation of predictions and the analyses of large information networks derived from massive data collection systems. Epidemics are modeled based on contact networks, biologic systems are modeled based on molecular interaction networks, cultural trends can be inferred from large collections of news text, interactions within social networks can be used to predict emerging product trends, and sensor-collected data can help in tracking and derive patterns for a myriad of human activities. Given the scale and nature of the data in most of the problems in this class, the problem becomes, to a large extent, socio-technical, in the sense that the computational challenges come hand-in-hand with societal challenges, given the possible implications of the use of knowledge derived from the analyses.
Having in mind that such societal challenges will require widely interdisciplinary competences at the international level, the project DataStorm will focus on creating a critical mass of scientists and engineers for addressing the design, implementation and operation of the new wave of large-scale data-intensive software systems. These systems will collect and integrate data from heterogenous sources, public and proprietary, from which large and complex graphs can be derived. These graphs can then be mined for patterns, from which models, predictions and various forms of knowledge can then be inferred.
A large team is assembled in a project structure comprising horizontal and vertical work-packages. The first, will address common research challenges associated with the analysis of large datasets of heterogeneous and imprecise network data in general, such as their acquisition and integration, indexing, querying and visualization, along with the proper information lifecycle management processes. Activities in the vertical work-packages involve the use of the researched techniques in application domains in which the members of the research team have been working: an infrastructure for biomolecular data and related information for tackling grand challenges in healthcare for an aging population, a sustainable food supply and protection of the environment; an infrastructure for collecting and managing data for epidemics modeling; a system for tracking social media and deriving entity interactions and the sentiment upon them; systems for opinion analysis and prediction of poll results; automatic ontology alignment of biomedical data; information lifecycle management principles in scientific, engineering and corporate business processes (including long-term preservation).
Despite the diversity of topics, there are many common issues in the above projects, spanning not only algorithms, data structures and methodologies, but also privacy and sustainability of open data in shared or proprietary infrastructures. In addition to the advances in basic research on data analysis in general and in specific information domains, it is expected that the highest impact results of this initiative will come from the cross-fertilization of research, development and advanced training activities to be articulated under a common framework for the first time within INESC-ID, creating a group with unprecedented critical mass in this domain at the National level.
The research team is interdisciplinary, composed mainly of members of the INESC-ID Research Line in Information and Decision Support Systems, complemented by strategic partners who can provide massive datasets and are interested in the scientific results of this initiative in the long run. The unit already includes diverse competencies, from applied mathematics, machine learning and natural language processing, to information management in large data infrastructures, knowledge integration and text mining. One of the main strategic goals of the research line that will be attained through DataStorm is leveraging cloud storage and computing technology with parallel processing algorithms for addressing the unprecedented scale and complexity of the information and knowledge management problems to be worked on in the next decade. DataStorm will provide a very competitive technology to expand the existing competencies in data and knowledge mining, enabling the unit to evolve to leveraging elastic parallel computing services running remotely, from processing data on locally managed clusters of servers as performed today.
With DataStorm, INESC-ID intends to strengthen key partnerships and maintain its leadership at the national level on massive data analysis, becoming an European reference institution with demonstrated competencies for addressing the petabyte-scale challenges of the upcoming calls in international research programmes. A significant effort will be dedicated to dissemination activities, bringing new researchers to large-scale data management and analysis though the development of advanced training initiatives, including open summer schools and workshops, complementing graduate courses and direct participation in research activities through scholarships.
H1: Data Acquisition and Information Extraction Pável Calado
H2: Data representation, querying and validation Alexandre Francisco (Ana Freitas na candidatura)
H3: Knowledge discovery from heterogeneous data Ana Teresa Freitas (Arlindo Oliveira na candidatura)
H4: Information Lifecycle Management J. Borbinha
H5: Contexts and Semantics of Information Sofia Pinto (Borbinha na candidatura)
H6: Domain Specific Languages for Large-scale-Data Applications Alberto Silva
Vertical Tasks V1: Biomedical Data Resources and Data Processing Infrastructure Pedro Monteiro (Ana Freitas na candidatura)
V2: Environmental Data Resources and Data Processing Infrastructure Paulo Carreira
V3: Societal Data Resources and Data Processing Infrastructure Bruno Martins (Alexandre Francisco na candidatura)
V4: Cultural Data Resources and Data Processing Infrastructure Bruno Martins; Nuno Freire até 31-Ago-2013 (Bruno Martins na candidatura)
Coordination Tasks C1: Dissemination Activities Mário
C2: Project Management Mário J. Silva
DataStorm is funded by FCT.
- Proj #: EXCL/EEI-ESS/0257/2012
- Period: 1-July-13 to 30-Jun-16
- FCT: €485.000,00 (INESC+IBET+IICT)
- support from AMA, PT Comunicações, FCCN
AMA - Agência para a Modernização Administrativa I.P. (AMA)
Fundação para a Computação Científica Nacional (FCCN)
Instituto de Biologia Experimental e Tecnológica (IBET)
Instituto de Investigação Científica e Tropical (IICT/MNE)
SAPO - PT Comunicações, SA (PT)