Treffer: Geographically distributed data management to support large-scale data analysis.
Chong, F. T., Heck, M. J. R., Ranganathan, P., Saleh, A. A. M. & Wassel, H. M. G. Data center energy efficiency: Improving energy efficiency in data centers beyond technology scaling. IEEE Des. Test 31(1), 93–104 (2014). (PMID: 10.1109/MDAT.2013.2294466)
Pegus, P. et al. Analyzing the efficiency of a Green University Data Center. In Proc. 7th ACM/SPEC on International Conference on Performance Engineering—ICPE 16 63–73 (ACM Press, 2016).
Labrinidis, A. & Jagadish, H. V. Challenges and opportunities with big data. Proc. VLDB Endow. 5(12), 2032–2033 (2012). (PMID: 10.14778/2367502.2367572)
Celesti, A., Galletta, A., Fazio, M. & Villari, M. Towards hybrid multi-cloud storage systems: Understanding how to perform data transfer. Big Data Res. 16, 1–17 (2019). (PMID: 10.1016/j.bdr.2019.02.002)
https://hadoop.apache.org/ (Accessed 13 December 2018).
Zaharia, M. et al. Apache spark: A unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). (PMID: 10.1145/2934664)
Mahmud, M. S., Huang, J. Z., Salloum, S., Emara, T. Z. & Sadatdiynov, K. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining Anal. 3(2), 85–101 (2020). (PMID: 10.26599/BDMA.2019.9020015)
Fernández, A., Gutiérrez-Avilés, D., Troncoso, A. & Martínez-Álvarez, F. Automated deployment of a spark cluster with machine learning algorithm integration. Big Data Res. 19–20, 100135 (2020). (PMID: 10.1016/j.bdr.2020.100135)
Ji, S. & Li, B. Wide area analytics for geographically distributed datacenters. Tsinghua Sci. Technol. 21(2), 125–135 (2016). (PMID: 10.1109/TST.2016.7442496)
Dolev, S., Florissi, P., Gudes, E., Sharma, S. & Singer, I. A survey on geographically distributed big-data processing using mapreduce. IEEE Trans. Big Data 5(1), 60–80 (2019). (PMID: 10.1109/TBDATA.2017.2723473)
https://investor.fb.com/investor-news/press-release-details/2020/Facebook-Reports-First-Quarter-2020-Results/default.aspx (Accessed 13 July 2020).
https://www.uber.com/ (Accessed 13 July 2020).
https://eng.uber.com/uber-big-data-platform/ (Accessed 13 July 2020).
Pu, Q. et al. Low latency geo-distributed data analytics. Comput. Commun. Rev. 45(5), 421–434 (2015). (PMID: 10.1145/2829988.2787505)
Hu, Z., Li, B. & Luo, J. Time- and cost-efficient task scheduling across geo-distributed data centers. IEEE Trans. Parallel Distrib. Syst. 29(3), 705–718 (2018). (PMID: 10.1109/TPDS.2017.2773504)
Mansouri, N. & Javidi, M. A new prefetching-aware Data Replication to decrease access latency in cloud environment. J. Syst. Softw. 144, 197–215 (2018). (PMID: 10.1016/j.jss.2018.05.027)
Dabas, C. & Aggarwal, J. An intensive review of data replication algorithms for cloud systems. In Emerging Research in Computing, Information, Communication and Applications (eds Shetty, N. R. et al.) 25–39 (Springer, 2019). (PMID: 10.1007/978-981-13-5953-8_3)
Wolfson, O. & Milo, A. The multicast policy and its relationship to replicated data placement. ACM Trans. Database Syst. 16(1), 181–205 (1991). (PMID: 10.1145/103140.103146)
Bae, M. M. & Bose, B. Resource placement in torus-based networks. IEEE Trans. Comput. 46(10), 1083–1092 (1997). (PMID: 10.1109/12.628393)
Loukopoulos, T., Lampsas, P. & Ahmad, I. Continuous replica placement schemes in distributed systems. In Proc. 19th Annual International Conference on Supercomputing, ICS 2005 (eds Rudolph, A. L.) 284–292 (ACM, 2005).
Rehn-Sonigo, V. http://arxiv.org/abs/0706.3350 .
Tzeng, N. & Feng, G. L. Resource allocation in cube network systems based on the covering radius. IEEE Trans. Parallel Distrib. Syst. 7(4), 328–342 (1996). (PMID: 10.1109/71.494628)
Abawajy, J. H. Placement of file replicas in data grid environments. In: Computational Science—ICCS 2004, 4th International Conference, Kraków, Poland, June 69, 2004, Proceedings, Part III, Vol. 3038 of Lecture Notes in Computer Science (eds Bubak, M. et al.) 66–73 (Springer, 2004).
Bell, W. H., Cameron, D. G., Carvajal-Schiaffino, R., Millar, A. P., Stockinger, K. & Zini, F. Evaluation of an economy-based file replication strategy for a data grid. In 3rd IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2003), 12–15 May 2003, Tokyo, Japan 661–668 (IEEE Computer Society, 2003).
Rahman, R. M., Barker, K. & Alhajj, R. Replica placement strategies in data grid. J. Grid Comput. 6(1), 103–123 (2008). (PMID: 10.1007/s10723-007-9090-8)
Ranganathan, K. & Foster, I. T. Identifying dynamic replication strategies for a high-performance data grid. In Grid Computing—GRID 2001, Second International Workshop, Denver, CO, USA, November 12, 2001, Proceedings, Vol. 2242 of Lecture Notes in Computer Science (eds Lee, C. A.) 75–86 (Springer, 2001).
Stockinger, H. et al. File and object replication in data grids. Clust. Comput. 5(3), 305–314 (2002). (PMID: 10.1023/A:1015681406220)
Mansouri, N. & Dastghaibyfard, G. H. A dynamic replica management strategy in data grid. J. Netw. Comput. Appl. 35(4), 1297–1303 (2012). (PMID: 10.1016/j.jnca.2012.01.014)
Hara, T. Effective replica allocation in ad hoc networks for improving data accessibility. In Proc. IEEE INFOCOM 2001, the Conference on Computer Communications, Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies, Twenty Years into the Communications Odyssey, Anchorage, Alaska, USA, April 22–26, 2001 1568–1576 (IEEE Computer Society, 2001).
Tu, M., Li, P., Xiao, L., Yen, I. & Bastani, F. B. Replica placement algorithms for mobile transaction systems. IEEE Trans. Knowl. Data Eng. 18(7), 954–970 (2006). (PMID: 10.1109/TKDE.2006.114)
https://www.datastax.com/resources/whitepapers/intro-to-multidc (Accessed 13 December 2018).
Dunning, T. & Ellen, F. Data Where You Want It (O’REILLY, 2017).
Emara, T. Z. & Huang, J. Z. A distributed data management system to support large-scale data analysis. J. Syst. Softw. 148, 105–115 (2019). (PMID: 10.1016/j.jss.2018.11.007)
Trinh, T., Duc, L., Tran, C., Duy, T. & Emara, T. A new stratified block model to process large-scale data for a small cluster. Artif. Intell. Data Big Data Process. Proc. ICABDE 2021, 253–263 (2022).
Wei, C. et al. A two-stage data processing algorithm to generate random sample partitions for big data analysis. In Cloud Computing—CLOUD 2018 (eds Luo, M. & Zhang, L.-J.) 347–364 (Springer, Berlin, 2018). (PMID: 10.1007/978-3-319-94295-7_24)
Emara, T. Z. & Huang, J. Z. Distributed data strategies to support large-scale data analysis across geo-distributed data centers. IEEE Access 8, 178526–178538 (2020). (PMID: 10.1109/ACCESS.2020.3027675)
Shvachko, K., Kuang, H., Radia, S. & Chansler, R. The hadoop distributed file system. In Proc. IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST 26 1–10 (IEEE Computer Society, 2010).
Ghemawat, S., Gobioff, H. & Leung, S.-T. The google file system. In Proc. Nineteenth ACM Symposium on Operating Systems Principles, SOSP 03 29–43 (ACM, 2003).
Ghemawat, S., Gobioff, H. & Leung, S.-T. The google file system. ACM SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003). (PMID: 10.1145/1165389.945450)
Gill, N. K. & Singh, S. Dynamic cost-aware re-replication and rebalancing strategy in cloud system. In Advances in Intelligent Systems and Computing Vol. 328 (eds Satapathy, S. C. et al.) 39–47 (Springer, 2015).
Liu, J. & Shen, H. A popularity-aware cost-effective replication scheme for high data durability in cloud storage. In 2016 IEEE International Conference on Big Data (Big Data) 384–389 (IEEE, 2016).
Acharya, S. & Zdonik, S. B. An Efficient Scheme for Dynamic Data Replication (Springer, 1993).
Matri, P., Pérez, M. S., Costan, A., Bougé, L. & Antoniu, G. Keeping up with storage: Decentralized, write-enabled dynamic geo-replication. Futur. Gener. Comput. Syst. 86, 1093–1105 (2018). (PMID: 10.1016/j.future.2017.06.009)
Mansouri, Y., Toosi, A. N. & Buyya, R. Cost optimization for dynamic replication and migration of data in cloud data centers. IEEE Trans. Cloud Comput. 7(3), 705–718 (2019). (PMID: 10.1109/TCC.2017.2659728)
Limam, S., Mokadem, R. & Belalem, G. Data replication strategy with satisfaction of availability, performance and tenant budget requirements. Clust. Comput. 22, 1199 (2019). (PMID: 10.1007/s10586-018-02899-6)
DeCandia, G. et al. Dynamo: Amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev. 41(6), 205–220 (2007). (PMID: 10.1145/1323293.1294281)
Sumbaly, R., Kreps, J., Gao, L., Feinberg, A., Soman, C. & Shah, S. Serving large-scale batch computed data with project voldemort. In Proc. 10th USENIX Conference on File and Storage Technologies, FAST 12 18 (USENIX Association, 2012).
Sinnott, R. W. Virtues of the haversine. Sky Telesc. 68, 158 (1984).
Emara, T. Z. & Huang, J. Z. RRPlib: A Spark Library for Representing HDFS Blocks as A Set of Random Sample Data Blocks, Science of Computer Programming 102301 (2019).
Weitere Informationen
Nowadays, several companies prefer storing their data on multiple data centers with replication for many reasons. The data that spans various data centers ensures the fastest possible response time for customers and workforces who are geographically separated. It also provides protecting the information from the loss in case a single data center experiences a disaster. However, the amount of data is increasing at a rapid pace, which leads to challenges in storage, analysis, and various processing tasks. In this paper, we propose and design a geographically distributed data management framework to manage the massive data stored and distributed among geo-distributed data centers. The goal of the proposed framework is to enable efficient use of the distributed data blocks for various data analysis tasks. The architecture of the proposed framework is composed of a grid of geo-distributed data centers connected to a data controller (DCtrl). The DCtrl is responsible for organizing and managing the block replicas across the geo-distributed data centers. We use the BDMS system as the installed system on the distributed data centers. BDMS stores the big data file as a set of random sample data blocks, each being a random sample of the whole data file. Then, DCtrl distributes these data blocks into multiple data centers with replication. In analyzing a big data file distributed based on the proposed framework, we randomly select a sample of data blocks replicated from other data centers on any data center. We use simulation results to demonstrate the performance of the proposed framework in big data analysis across geo-distributed data centers.
(© 2023. Springer Nature Limited.)