Having, Doing, and Being

In-place sorting

기존 배열 내에서 정렬을 진행한다. 즉, 기존 배열 외의 다른 새로운 배열을 필요로 하지 않는다.

예: 삽입 정렬, 선택 정렬

External Sorting, 외부 정렬

메인 메모리(주로 RAM)에 들어가지 않을 정도로 큰 데이터를 정렬하기 위한 방법이다. 따라서 하드 드라이브에 저장되어 사용되는데, 이는 속도가 느려질 수 밖에 없는 방법이다.

Internal Sorting: 메인 메모리 내에 정렬이 필요한 모든 데이터가 들어갈 경우

Hybrid sort-merge strategy

외부 정렬은 주로 hybrid sort-merge strategy 를 사용한다. 우선, 메인 메모리에 들어갈 수 있는 데이터의 일부를 가져온다. sort 단계에서는 읽고, 정렬하고, 임시 파일에 쓴다. merge 단계에서는 정렬된 임시 파일들이 하나의 파일로 합쳐진다.

algorithm

입력: input_file: input.txt output_file: output.txt run_size: 메모리에 들어갈 수 있는 run 파일의 사이즈 num_ways: 합쳐져야 하는 run 파일의 개수 출력: 1. input_file 을 run_size 만큼 읽는다. a. run 파일을 merge sort 방법으로 정렬한다 b. 정렬된 run 을 임시 파일에 i 번째 run 으로 저장한다. c. 모든 run 들이 정렬될 때까지 a, b 를 반복한다. 2. 정렬된 파일들을 합친다.

code ?

code from geeksforgeeks. tested and edited but error occurred.

Critical error detected c0000374

// C++ program to implement external sorting using 
// merge sort 
#include <cstdio>
#include <cstdlib>
#include <time.h>
#define _CRT_SECURE_NO_WARNINGS
using namespace std;

struct MinHeapNode
{
	// The element to be stored 
	int element;

	// index of the array from which the element is taken 
	int i;
};

// Prototype of a utility function to swap two min heap nodes 
void swap(MinHeapNode* x, MinHeapNode* y);

// A class for Min Heap 
class MinHeap
{
	MinHeapNode* harr; // pointer to array of elements in heap 
	int heap_size;	 // size of min heap 

public:
	// Constructor: creates a min heap of given size 
	MinHeap(MinHeapNode a[], int size);

	// to heapify a subtree with root at given index 
	void MinHeapify(int);

	// to get index of left child of node at index i 
	int left(int i) { return (2 * i + 1); }

	// to get index of right child of node at index i 
	int right(int i) { return (2 * i + 2); }

	// to get the root 
	MinHeapNode getMin() { return harr[0]; }

	// to replace root with new node x and heapify() 
	// new root 
	void replaceMin(MinHeapNode x)
	{
		harr[0] = x;
		MinHeapify(0);
	}
};

// Constructor: Builds a heap from a given array a[] 
// of given size 
MinHeap::MinHeap(MinHeapNode a[], int size)
{
	heap_size = size;
	harr = a; // store address of array 
	int i = (heap_size - 1) / 2;
	while (i >= 0)
	{
		MinHeapify(i);
		i--;
	}
}

// A recursive method to heapify a subtree with root 
// at given index. This method assumes that the 
// subtrees are already heapified 
void MinHeap::MinHeapify(int i)
{
	int l = left(i);
	int r = right(i);
	int smallest = i;
	if (l < heap_size && harr[l].element < harr[i].element)
		smallest = l;
	if (r < heap_size && harr[r].element < harr[smallest].element)
		smallest = r;
	if (smallest != i)
	{
		swap(&harr[i], &harr[smallest]);
		MinHeapify(smallest);
	}
}

// A utility function to swap two elements 
void swap(MinHeapNode* x, MinHeapNode* y)
{
	MinHeapNode temp = *x;
	*x = *y;
	*y = temp;
}

// Merges two subarrays of arr[]. 
// First subarray is arr[l..m] 
// Second subarray is arr[m+1..r] 
void merge(int arr[], int l, int m, int r)
{
	int i, j, k;
	int n1 = m - l + 1;
	int n2 = r - m;

	/* create temp arrays */
	int* L = new int(n1);
	int* R = new int(n2 + 1);

	/* Copy data to temp arrays L[] and R[] */
	for (i = 0; i < n1; i++)
		L[i] = arr[l + i];
	for (j = 0; j < n2; j++)
		R[j] = arr[m + 1 + j];

	/* Merge the temp arrays back into arr[l..r]*/
	i = 0; // Initial index of first subarray 
	j = 0; // Initial index of second subarray 
	k = l; // Initial index of merged subarray 
	while (i < n1 && j < n2)
	{
		if (L[i] <= R[j])
			arr[k++] = L[i++];
		else
			arr[k++] = R[j++];
	}

	/* Copy the remaining elements of L[], if there
	are any */
	while (i < n1)
		arr[k++] = L[i++];

	/* Copy the remaining elements of R[], if there
	are any */
	while (j < n2)
		arr[k++] = R[j++];
}

/* l is for left index and r is right index of the
sub-array of arr to be sorted */
void mergeSort(int arr[], int l, int r)
{
	if (l < r)
	{
		// Same as (l+r)/2, but avoids overflow for 
		// large l and h 
		int m = l + (r - l) / 2;

		// Sort first and second halves 
		mergeSort(arr, l, m);
		mergeSort(arr, m + 1, r);

		merge(arr, l, m, r);
	}
}

FILE* openFile(char* fileName, char* mode)
{
	FILE* fp = fopen(fileName, mode);
	if (fp == NULL)
	{
		perror("Error while opening the file.\n");
		exit(EXIT_FAILURE);
	}
	return fp;
}

// Merges k sorted files. Names of files are assumed 
// to be 1, 2, 3, ... k 
void mergeFiles(char* output_file, int n, int k)
{
	FILE** in = new FILE* [k];

	for (int i = 0; i < k; i++)
	{
		char fileName[2];

		// convert i to string 
		snprintf(fileName, sizeof(fileName), "%d", i);

		// Open output files in read mode. 
		in[i] = fopen(fileName, "r");
	}

	// FINAL OUTPUT FILE 
	FILE* out = fopen(output_file, "w");

	// Create a min heap with k heap nodes. Every heap node 
	// has first element of scratch output file 
	MinHeapNode* harr = new MinHeapNode[k];
	int i;
	for (i = 0; i < k; i++)
	{
		// break if no output file is empty and 
		// index i will be no. of input files 
		if (fscanf(in[i], "%d ", &harr[i].element) != 1)
			break;

		harr[i].i = i; // Index of scratch output file 
	}
	MinHeap hp(harr, i); // Create the heap 

	int count = 0;

	// Now one by one get the minimum element from min 
	// heap and replace it with next element. 
	// run till all filled input files reach EOF 
	while (count != i)
	{
		// Get the minimum element and store it in output file 
		MinHeapNode root = hp.getMin();
		fprintf(out, "%d ", root.element);

		// Find the next element that will replace current 
		// root of heap. The next element belongs to same 
		// input file as the current min element. 
		if (fscanf(in[root.i], "%d ", &root.element) != 1)
		{
			root.element = INT_MAX;
			count++;
		}

		// Replace root with next element of input file 
		hp.replaceMin(root);
	}

	// close input and output files 
	for (int i = 0; i < k; i++)
		fclose(in[i]);

	fclose(out);
}

// Using a merge-sort algorithm, create the initial runs 
// and divide them evenly among the output files 
void createInitialRuns(char* input_file, int run_size,
	int num_ways)
{
	// For big input file 
	FILE* in = fopen(input_file, "r");

	// output scratch files 
	FILE** out = new FILE * [num_ways];
	char fileName[2];
	for (int i = 0; i < num_ways; i++)
	{
		// convert i to string 
		snprintf(fileName, sizeof(fileName), "%d", i);

		// Open output files in write mode. 
		out[i] = fopen(fileName, "w");
	}

	// allocate a dynamic array large enough 
	// to accommodate runs of size run_size 
	int* arr = (int*)malloc(run_size * sizeof(int));

	bool more_input = true;
	int next_output_file = 0;

	int i;
	while (more_input)
	{
		// write run_size elements into arr from input file 
		for (i = 0; i < run_size; i++)
		{
			if (fscanf(in, "%d ", &arr[i]) != 1)
			{
				more_input = false;
				break;
			}
		}

		// sort array using merge sort 
		mergeSort(arr, 0, i - 1);

		// write the records to the appropriate scratch output file 
		// can't assume that the loop runs to run_size 
		// since the last run's length may be less than run_size 
		for (int j = 0; j < i; j++)
			fprintf(out[next_output_file], "%d ", arr[j]);

		next_output_file++;
	}

	// close input and output files 
	for (int i = 0; i < num_ways; i++)
		fclose(out[i]);

	fclose(in);
}

// For sorting data stored on disk 
void externalSort(char* input_file, char* output_file,
	int num_ways, int run_size)
{
	// read the input file, create the initial runs, 
	// and assign the runs to the scratch output files 
	createInitialRuns(input_file, run_size, num_ways);

	// Merge the runs using the K-way merging 
	mergeFiles(output_file, run_size, num_ways);
}


// Driver program to test above 
int main()
{
	// No. of Partitions of input file. 
	int num_ways = 10;

	// The size of each partition 
	int run_size = 1000;

	char input_file[] = "input.txt";
	char output_file[] = "output.txt";

	FILE* in = fopen(input_file, "w");

	srand(time(NULL));

	// generate input 
	for (int i = 0; i < num_ways * run_size; i++)
		fprintf(in, "%d ", rand());

	fclose(in);

	externalSort(input_file, output_file, num_ways,
		run_size);

	printf("finished");

	return 0;
}

선행 알고리즘: 합병 정렬

Stable Sorting, 안정 정렬

정렬에서의 안정성은 key 와 value 를 갖는 pair 에서 중요하다. 정렬 알고리즘이 안정성을 갖기 위해서는, key 를 기준으로 정렬할 때 동일한 key 를 갖는 두 pair 가 정렬하기 전과 후에 동일한 순서로 남아있어야 한다.

A 는 배열이고, F는 정렬 알고리즘이라고 하자. < 는 A 의 원소에 대해 strict weak order 를 나타낸다.
i < j 이고 A[i] = A[j] 일 때, F(i) < F(j) 이다.

모든 key 가 다르거나, key 와 쌍을 갖는 value 가 없을 경우는 안정성이 무의미하다.

example

다음과 같이 (이름, 반) 을 쌍으로 갖는 데이터가 있다.

이 때, 이름 순으로 먼저 정렬하고, 그 다음 반 순으로 정렬하려고 한다.

(Dave, A)

(Alice, B)

(Ken, A)

(Eric, B)

(Carol, A)

이름 순 정렬

데이터는 반 끼리 묶여있지 않다.

(Alice, B)

(Carol, A)

(Dave, A)

(Eric, B)

(Ken, A)

이름 순 정렬 이후 반 순 정렬

(Eric, B) 이 (Alice, B) 보다 앞에 있기에 반끼리 묶여있지만, 이름 순으로 정렬되어있지 않다.

(Carol, A)

(Dave, A)

(Ken, A)

(Eric, B)

(Alice, B)

안정 정렬

배열의 원소가 이름 순으로 정렬되어 있고, 반끼리 묶여있으며 반 순으로 정렬되어 있다.

(Carol, A)

(Dave, A)

(Ken, A)

(Alice, B)

(Eric, B)

1번에서의 이름 순으로 정렬된 데이터를 다시 반 순으로 정렬했을 때, 같은 key 가 있을 경우 (ex. Alice 의 B, Eric 의 B), 기존 배열에서 Alice 의 B 가 Eric 의 B보다 앞에 있었기에 이후 정렬된 배열 역시 전자가 앞에 있다. 즉, 배열의 상대적인 위치를 고려하여 안정된 정렬을 만들 수 있다.

정렬 알고리즘과 안정성

안정성을 갖는 정렬 알고리즘:

Stable by nature:

Bubble Sort, Insertion Sort, Merge Sort, Count Sort etc.

Comparison based stable:

Merge Sort, Insertion Sort

인덱스 i, j (i < j) 에 대해 A[j] < A[i] 일 경우에만 A[j] 는 A[i] 보다 먼저 온다. 만약 A[i] = A[j] 일 경우는 A[i] 가 A[j] 보다 먼저 온다.

Non-comparison based stable: ?

Counting Sort

정렬된 배열이 역순으로 채워진다면 동일한 key 를 가진 원소들은 기존과 동일한 상대 위치를 가질 것이다.

Others: ??

Radix Sort

depend on another sort, with the only requirement that the other sort should be stable.

안정성을 갖지 않는 정렬 알고리즘:

Quick Sort, Heap Sort etc.

입력 배열 원소의 순서에 따라 안정성을 가질 수도 있다.

어떤 정렬 알고리즘이든 안정성을 갖도록 할 수 있을까?

그렇다. 우선, 각 알고리즘을 안정성을 갖도록 하는 특정한 방법이 있다. 일반적으로 모든 알고리즘에 적용되는 방법으로서는, (comparison based 정렬 알고리즘에 대해) key 를 비교하는 연산자를 바꾸는 데에 있다. 두 개의 키를 비교할 때, key 가 동일하다면 그 두 원소의 위치도 비교하여 정렬할 수 있다.